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1.  Summary 

This  document  summarizes  work  completed  by  CRCL  Inc  (the  Center  for  Research  in  Computational 
Linguistics,  a  US  501(c)3  nonprofit  organization)  in  the  period  July  1  2015  -  December  31  2016  as  part 
of  the  DARPA  LORELEI  project,  contract  number  HR0011-15-C-0117.  It  also  provides  an  overall 
view  of  LORELEI,  and  our  role  in  it. 

The  LORELEI  program  intends  to  advance  the  state  of  computational  linguistics  and  human 
language  technology,  enabling  rapid,  low-cost  development  of  capabilities  for  low-resource  languages. 
These  will  provide  situational  awareness  based  on  information  from  any  language,  supporting  emergent 
missions  such  as  humanitarian  assistance/disaster  relief,  peacekeeping,  or  infectious  disease  response. 

LORELEI  Technical  Area  1  addresses  the  core  research  challenge  of  rapidly  developing  language 
processing  tools  for  a  language  without  reliance  on  large  corpora  or  extensive  human  annotation  efforts. 
TA1.1  focuses  on  research  and  development  of  novel  techniques  to  discover  and  use  “universal” 
properties  and  (typological  or  other)  regularities  of  languages,  reducing  reliance  on  huge  quantities  of 
language- specific  information  for  translation,  information  extraction,  or  other  language  technologies. 
This  research  area  builds  on  knowledge  of  the  characteristic  tendencies  and  regularities  of  human 
language,  but  is  not  limited  to  “absolute”  universals  that  apply  to  every  known  language. 

As  a  TA1.1  performer,  CRCL’s  task  (as  outlined  in  the  Statement  of  Work  and  listed  as  a  series  of 
deliverable  milestones)  was  to: 

•  deliver  cleaned,  normalized,  curated  lexical  data  and  cognate  groupings  for  200-250  distinct 
languages  (following  ISO  639-3)  per  year. 

We  also  pursued  two  general  activities  on  behalf  of  the  program: 

•  discover  and  implement  means  of  analyzing  and  enriching  the  data  sets, 

•  interact  with  other  performers  to  help  define  and  enable  downstream  applications. 

These  involved  enhancing  and  devising  applications  for  our  (and  other)  small  lexicons.  CRCL  was 
retained  under  a  one-year  contract,  and  an  additional  six-month  extension.  All  project  results  are 
available  for  re-use  under  a  Creative  Commons  4.0  license. 

Problem  Description: 

The  U.S.  government  does  not  have  hard,  language-by-language  content  data,  which  might  support 
action  or  planning,  for  more  than  a  fraction  of  the  world's  7,000  languages.  Existing  typological 
descriptions  (e.g.  WALS)  are  sparse,  phonological  data  (e.g.  PHOIBLE,  with  «25%  coverage)  is 
limited,  and  denotational  descriptions  (e.g.  the  single-source  Ethnologue)  do  not  include  or  reference 
documentary  data.  This  resource  gap  affects  both  practical  operational  concerns  -  providing  actors  on 
the  ground  with  “human  intelligence”  regarding  speaker  communities  -  and  long-term  strategic 
technology  planning  for  language-engineering  tools:  we  can't  issue  a  challenge  to  develop  new  tools  for 
small-footprint,  low-density  languages  without  gold-standard  resources  to  assess  their  results. 

Weighing  cost,  availability,  and  linguistic  value,  the  only  universally  representative,  fine-grained 
resource  we  might  plausibly  assemble  must  be  based  on  the  relatively  small  lexicons  (<2,500  words) 
typically  gathered  for  comparative,  survey,  or  linguistic  research  purposes.  We  are  assembling  and 
improving  this  resource  for  the  five  linguistic  families  (totaling  about  2,000  languages)  that  dominate 
the  Asia-Pacific  region  (28  of  the  36  USPACOM  countries).  This  region  includes  7  of  10  “global 
hotspots  of  disaster  risk”  (World  Risk  Report  2013),  and  has  potential  for  future  conflict  in  restive  areas 
of  Myanmar,  South  China,  Northeast  India,  and  insular  Southeast  Asia. 

Expected  Impact: 

Paradoxically,  the  best  known  /  most  successful  languages  (e.g.  Thai  or  Vietnamese,  for  which  we  have 
the  most  resources)  are  usually  poor  representatives  of  the  family  as  a  whole.  As  the  only  LORELEI 
project  focused  on  assembling  fine-grained  language  datasets,  we  contribute  to  several  core  problems: 
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-  identifying  training/translation  pivot  languages :  languages  are  related  to  one  another  by  both 
inheritance  -  they  share  a  common  ancestor,  and  by  contact  -  one  borrows  from  the  other,  or  both 
borrow  from  languages  in  common.  Using  the  techniques  of  comparative  and  historical  linguistics  and 
dialectometry  will  let  us  suggest  which  language  best  represents  a  given  group,  and  would  produce  the 
best  results  when  adapted  to  a  low-density  incident  language. 

-  identifying  low-density/small  footprint  languages :  a  low-density  language  has  few  computational  or 
analytical  resources;  a  small  footprint  language  is  difficult  to  even  find  data  for.  This  means  that  it  may 
be  difficult  to  even  identify  the  language  of  a  potentially  important  audio  or  text  sample.  We  provide  at 
least  small  lexicons  for  (ideally)  half  the  languages  in  the  region;  these  may  be  the  only  formal  resources 
available  for  language  identification. 

-  predicting  high-value  investment  languages:  rather  than  scrambling  to  back-fit  existing  resources  to 
incident  languages,  we  propose  that  four  factors  will  help  predict  languages  are  worth  investing  in  now: 
a)  linguistic  centrality,  b)  currently  available  resource  base,  c)  speaker  population,  and  d)  risk  history. 

-  producing  gold-standard  sets  of  normalized  lexical  data  and  cognate  assignments.  This  provides 
ground-truth  data  for  future  research  on  rapid  development  or  adaptation  of  tools  and  resources  for  low- 
density  languages. 

Research  Goals: 

Specific  goals:  the  project  will  extend  and  apply  CRCL  technology  required  to  normalize  phonological 
transcription  and  semantic  glossing  of  a  very  large  number  of  lexicons  (donated  to  the  project  in 
electronic  form  by  CRCL)  -  and  to  identify  large  numbers  of  related  cognate  words,  which  help  refine 
our  understanding  of  (and  predictive  capacity  for)  variation  between  related  languages.  Our  deliverable 
is  the  finished  product:  normalized  lexicons  and  marked  cognate  sets. 

Performance  improvements:  very  few  of  the  world's  languages  can  provide  enough  electronic  data  (e.g. 
via  Web  pages  or  social  media)  to  support  current  computational  approaches  to  language  modeling.  We 
will  provide  the  hard  data  required  to  produce  phonological  models,  infer  etymological  and  loan 
relationships,  predict  word  forms  (e.g.  for  entity  recognition),  and  to  support  unknown  language 
identification. 

New  capabilities:  In  narrower  terms,  the  project  makes  it  possible  to: 

-  extract  “phonodynamic”  language  models;  that  is,  phonological  and  phonotactic  sketches  whose 
elements  can  be  weighted  against  the  lexicon  for  frequency,  functional  load,  salience,  phonological 
neighborhood  characteristics,  and  so  on.  This  is  the  type  of  information  that  helps  humans  nearly 
instantly  identify  even  languages  they  do  not  speak. 

-  identify  shibboleths;  that  is,  simple  words  from  two  or  more  languages  that  do  not  resemble  each  other 
phonologically,  and  can  be  used  to  help  identify  speaker  language. 

-  show  the  linguistic  ground  path  of  an  expected  event;  that  is,  identify  the  speaker  communities  that  are 
predicted  to  be  in  the  path  of  a  typhoon,  tsunami,  epidemic,  or  other  disaster 

-  show  the  human  terrain  of  an  ongoing  event;  that  is,  identify  the  speaker  communities  within  the 
known  bounds  of  an  ongoing  political  or  natural  crisis. 

-  build  tools  for  automated  orthography-to-phonology;  e.g.  for  generating  phonological  transcription  of 
L2  dictionaries  or  texts. 

-  while  project  data  is  at  arms’  length  from  current  MT  applications,  it  is  reasonable  to  expect  that 
regular  sound-change  models  will  support  some  named  entity  identification. 

-  while  project  data  is  at  arms’  length  from  current  speech-to-text  applications,  it  is  likely  to  support 
basic  functionality  like  word  boundary  recognition. 
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2.  Introduction 

Problem  description 

The  U.S.  responds  to  global  emergencies  of  all  types.  Doing  this  effectively,  safely,  and  efficiently 
relies  on  local,  non-English  sources  of  information.  But  while  there  are  an  estimated  7,000  world 
languages,  technology  for  automated  translation,  summarization,  sentiment  assessment  and  the  like  is 
only  available  for  a  tiny  percentage  -  perhaps  350  (5%)  of  them.  It  is  possible  to  develop  such  resources 
language  by  language,  but  that  is  a  slow  and  expensive  process,  estimated  at  $10,000,000  each. 

Most  people  speak  more  than  one  language;  perhaps  by  choice  in  the  developed  world,  but  as  a 
matter  of  necessity  in  the  developing  world,  where  one’s  mother  tongue  is  usually  not  the  language  of 
education  and  government.  .  Even  though  English  and  the  other  well-provisioned  languages  are  near- 
universal  linguae  francae  in  times  of  peace  (and  when  people  wish  to  be  understood),  in  times  of 
emergency  or  conflict  (and  when  people  do  not  necessarily  want  to  be  understood)  the  smaller  languages 
become  increasingly  important. 

Code  switching  -  slipping  into  a  second  language  in  the  course  of  written  or  spoken  discourse  -  is 
well-understood  not  only  as  a  means  of  concealing  information,  but  as  a  marker  of  information  that  is 
especially  urgent  or  meaningful.  Even  if  translation  technology  or  a  detailed  language  description  is  not 
available,  the  simple  ability  to  identify  any  and  every  language  is  an  important  tool.  Consider  countries 
like  Indonesia,  Myanmar,  the  Philippines,  and  China  (with  700,  117,  200,  and  300  languages, 
respectively).  We  can  readily  acquire  Twitter  feeds  and  on-line  messaging,  but  messages  in,  or  mixed 
with,  most  minority  languages  will  be  discarded  simply  because  we  cannot  classify  them. 

Surprisingly,  perhaps,  we  do  not  have  digital  language  models,  printed  descriptions,  or  reference 
samples  for  most  languages.  Many  websites  that  purport  to  provide  language  documentation  on  a  global 
scale  generally  draw  from  a  handful  of  sources,  such  as  Ethnologue  or  Wikipedia ,  which  themselves 
supply  only  bare  details.  The  depth  of  coverage  available  falls  off  rapidly,  even  from  sites  ( World  Atlas 
of  Language  Structures )  that  are  widely  cited  in  the  literature.  And  the  type  of  information  provided 
may  be  of  interest  for  linguistic  purposes,  but  of  little  value  for  computational  linguistic  applications; 
e.g.  a  phonological  sketch  that  contains  only  a  list  of  phonemes,  without  any  frequency  or  phonotactic 
detail. 

Traditionally,  language  technology  efforts  have  worked  from  the  top  down,  beginning  by  developing 
resources  and  tools  for  the  largest  languages  (such  as  English,  Chinese,  various  European  languages), 
and  gradually  trickling  down  to  smaller  languages.  LORELEI’s  predecessor,  the  REFLEX  LCTL 
project,  attempted  to  extend  and  accelerate  this  process  (and  produced  resources  used  in  LORELEI). 
While  the  idea  of  using  interlingua  pivot  languages  is  not  new,  applications  have  been  limited. 

LORELEI  attempts  to  change  this  equation  by  recognizing  that  neither  language- specific  translation 
technology  nor  extensive  language  resources  are  necessarily  required  to  obtain  actionable  information. 
At  the  extreme,  a  “peephole”  view  into  communications  all  that  is  required  to  recognize  disaster-related 
words  or  sentiment.  The  challenge  is  not  to  translate  all  messages,  but  rather  to  recognize  high-value 
messages  or  messaging. 

The  Leveraging  Small-Lexicon  Language  Models  project 

CRCL’s  contribution  begins  with  the  broad  question  can  small  lexicons  help  solve  big  language 
problems ?  Can  minimal,  but  fine-grained,  phonological  and  lexical  data  make  a  useful  contribution  to 
both  regional  and  global  understanding  of  language  universals  and  interaction?  Is  it  even  possible  to 
develop  such  data  resources  so  quickly? 

CRCL  proposed  to  provide  small  lexicons  for  200-250  distinct  Asia-Pacific  ISO  639-3  codes  per 
year.  Although  LORELEI’s  scope  is  world-wide,  we  chose  to  focus  on  Asia-Pacific  for  a  variety  of 
reasons,  the  main  one  being  that  it  was  the  largest  possible  region  in  terms  of  languages  that  could 
reasonably  be  managed  within  the  confines  of  the  program. 
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complexity:  extremely  high  language 
density 

Indonesia:  700,  China:  300,  Philippines:  200, 
Malaysia:  146,  Nepal  125,  Myanmar  117  ... 

history:  “global  hotspots  of  disaster  risk” 

7  of  10  highest-risk  countries  arc  in  Asia-Pacific 

risk:  likely  regions  of  future  conflict 

“highland”  populations,  borders  within  borders, 

responsibility:  US  Pacific  Command 
region 

36  countries,  3,000  languages  -  28  /  2000  in  our 
defined  area 

infrastructure:  a  TIPSTER  moment 

providing  linguistic  data  for  the  long  tail  of  least- 
resourced  languages 

Table  1  Motivations  for  “Small  Lexicon"  project  design. 


The  Asia-Pacific  region  is  home  to  some  3,000  languages:  more  than  40%  of  the  world’s  total.  Our 
mission  has  been  to  define,  and  then  deliver,  the  resources  that  will  have  the  largest  impact  on  language 
understanding.  To  do  this  we  have  focused  on: 

•  five  language  families  that  account  for  some  2,000  languages,  and  blanket  nearly  the  entire 
region  from  the  Himalayas  to  the  South  Pacific  (excluding  Australia  and  parts  of  New  Guinea), 

•  small  lexicons,  typically  ranging  from  500  -  2,500  words,  that  were  assembled  for  language 
survey,  sketch,  and/or  comparative  research. 

These  are  typically  high-quality  resources  that  provide  detailed  phonological  transcription  of  all  items. 
We  chose  to  focus  on  small  lexicons  for  two  reasons: 

•  they  are  the  only  nearly  universal  resource  available, 

•  although  they  only  supply  a  modest  amount  of  translation,  comparative  and  survey  lexicons  are 
the  ideal  minimal  resource  for  language  modeling. 

Project  details  In  the  context  of  our  project: 

•  data  is  almost  invariably  received  in  phonological  transcription,  not  formal  orthography, 

•  nearly  all  data  has  been  previously  published  (or  collected  and  not  published  for  one  reason  or 
another).  We  are  not  eliciting  new  data,  or  transcribing  existing  field  tapes. 

•  lists  were  elicited  as  part  of  field  or  comparative  surveys,  usually  by  trained  linguists,  and  are  of 
objectively  high  quality,  especially  in  contrast  to  typical  “found”  data  sources,  however  ... 

•  lists  are  often  sui  generis,  not  based  on  text  corpora,  or  supported  by  other  reference  resources; 
hence,  it  is  not  always  possible  to  confirm  our  interpretation  of  the  authors’  intent, 

•  original  lists  usually  have  <2,500  items.  Some  survey  lists  will  be  shorter;  many  SIL  surveys 
track  -400-500  items,  and  in  some  areas  200+  item  Swadesh-style  lists  are  all  that  are  available, 

•  items  are  usually  glossed  with  a  single  sense  -  not  defined  with  multiple  senses, 

•  lemma  forms  are  most  common,  compounds  and  complex  morphology  less  so.  With  a  few 
exceptions,  only  Austronesian  (AN)  languages  regularly  have  inflectional  morphology;  particles 
and  auxiliaries  are  common  in  the  other  families. 

•  some  sources  may  mark  morphological  boundaries;  these  marks  are  passed  through  in  the  raw 
forms,  but  not  in  the  normalized  forms. 

In  the  18  months  of  our  project  we  focused  on  a  relatively  small  number  of  sources  that  provide  broad 
geographic  and  phylogenetic  coverage,  and  raise  a  wide  sample  of  typological  and  notational  issues. 

Processed  CRCL  datasets  are  assembled  by  running  raw  inputs  through  a  software  system  that  is 
frequently  tweaked  and  rebuilt.  Datasets  are  provided  both  as  single  aggregated  files  (one  per  resource 
type),  and  as  many  files  distributed  in  a  family  /  ISO  /  lect  directory  hierarchy  in  XML  and  TSV  formats. 
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CRCL  had  three  primary  tasks  in  acquiring  and  working  with  raw  lexicon  data: 

•  add  a  layer  of  normalized  glosses  we  call  MetaGlosses ;  usually  numbered  WordNet  3.0  senses. 

Metagloss  semantics  index  words  that  are  etymologically  related,  and  whose  raw  glosses  differ  only 

by  an  authors’  choice  of  vocabulary  or  phrasing:  rock  versus  stone.  The  metagloss  will  not  greatly 
diverge  from  the  raw  gloss  even  if  etymological  grouping  might  call  for  it.  But,  in  moderately 
ambiguous  situations  ( cloudy  versus  gloomy)  we  favor  the  more  common  term.  Most  metaglosses  are 
WordNet  3.0  senses  [Miller  1995],  extended  when  necessary  to  fill  English-language  lexical  gaps,  or  to 
allow  consistent  handling  of  categories  like  kin  terms.  On  occasion,  insight  gained  from  downstream 
cognate  grouping  may  prompt  revision  of  a  sense  assignment. 

•  add  a  layer  of  normalized  phonological  forms  we  call  MetaForms. 

Metaforms  generally  make  unambiguous  substitutions  that  transform  ad  hoc  notations  to  (usually) 
standard  IPA  notations.  As  with  metaglosses,  situations  arise  in  which  raw  forms  must  be  slightly 
reinterpreted  to  achieve  the  consistency  downstream  applications  require.  Normalized  metaforms  are 
analyzed  into  syllables,  sub-syllabic  components,  and  individual  phonological  segments.  Our  goals  are 
utilitarian,  rather  than  theoretical:  to  reveal,  measure,  and  if  possible  extend  the  form’s  usefulness  for 
language  identification,  lexicon  extension,  cognate  identification,  audio  segmentation  or  transcription, 
and  similar  applications.  As  with  glossing,  transcriptions  may  occasionally  be  revised  with  the  benefit 
of  information  from  cognate  grouping. 

•  group  etymologically  related  forms  into  cognate  sets  we  call  EtySets. 

When  possible  we  seek  support  and  guidance  from  comparative  sets  and  proto-form  reconstructions 
found  in  the  literature.  We  do  not  produce  new  reconstructions,  or  attempt  to  discover  long-range 
etymological  relations.  We  anticipate  that  the  primary  use  of  our  cognate  sets  will  be  to  support 
applications  like  lexicon  extension,  drawing  on  evidence  from  predictable,  regular  phonological 
variation  between  relatively  closely  related  languages. 


cognate  set  reference 
language  metadata 

gloss  data  -  silver  item  is  WN  3.0 

form  data 
brief  form  analysis 

detailed  form  analysis 

we  show  syllable  structure 
and  sub-structure,  along 
with  positional 
(phonotactic)  information 


<entry  id="hudak2008comparative :C:cll.rl51a.gl51.i2391"> 

<f ami ly>KD</ family > 

<cogset>KD : H151</ cogset> 

<iso>nut</iso> 

<language>Nung  (Viet  Nam) </language> 

<dialect>We sternc/ dialect> 

<latLong>22 . 1166, 105 . 5255</latLong> 

<country>Viet  Nam</country> 

<adm  level="l">Tinh  Bac  Kan</adm> 

<gloss  status=" copper " >hammer</ gloss > 

<gloss  status=" silver ">hammer#n#2</gloss> 

<form  status="copper ">hun6  thii1</form> 

<form  status="silver">hun55  thi : l4</f orm> 

<form  s tat us =" silver "  style="tokenized"> : . h . :  | u . . .  |  . n . :  j  55  :.th.:|i:...|  !  1 4< / form> 

<form  status=" silver "  style=" segmented">h . u . n . 55  th  .  i  :  .  l4</f orm> 

<tokens> 

<syllable  canon="CV"> 

<onset><core>h</  corex/  onset> 

<nucleus><core  pos="  3 . 1  ">u</ corex/ nucleus> 

<coda><core>n</core></ coda> 

<tone>55</tone> 

</syllable> 

<space  /> 

<syllable  canon="CV"> 

<onset><core>th</  corex/ onset  > 

CnucleusXcore  pos="3  .  l">i  :  </core></nucleus> 

<tone> l4</tone> 

</syllable> 

</tokens> 

</ entry> 


Figure  1  A  sample  of  the  information  that  accompanies  each  of  the  850,000+  delivered  lexical  items. 


5 


A  typical  result  item  is  shown  in  Figure  1.  This  “easy  to  use”  XML  build  (from  a  lexicon.xml  file) 
bakes  in  source  and  language  metadata,  shows  both  raw  (“copper”)  and  normalized  (“silver”)  versions 
of  the  gloss  and  form,  and  includes  a  brief  and  detailed  phonological  analysis  of  the  normalized  form. 
This  layout  can  be  modified  if  desired. 

This  is  the  type  of  data  required  for  quantitative  and  comparative  methods  of  inference  of  trees  of 
inherited  phylogenetic  relations,  and  graphs  of  loan  relations.  It  lets  us  address  the  following  kinds  of 
questions  (although  implementing  these  was  beyond  CRCL’s  project  scope): 

•  given  a  basic  (200-2,500  words)  incident  language  lexicon  in  its  areal  context,  can  we  infer 
enough  details  of  phonology  and  morphology  to  extend  functional  vocabulary  using  non-incident 
language  resources? 

•  given  text  from  a  low-resource  incident  language,  can  we  use  a  basic  lexicon,  a  language  model 
at  least  partially  obtained  from  it,  and  one  or  more  pivot  languages  to  enable  translation,  named- 
entity  recognition,  or  other  situational  understanding? 

•  can  minimal,  but  fine-grained,  phonological  and  lexical  data  make  a  useful  contribution  to  both 
regional  and  global  understanding  of  language  universals  and  interaction? 

The  project  raised  many  other  questions  and  possibilities  as  well: 

•  how  much  information  does  automatic  transcription  require? 

•  how  well  do  wordlists  enable  phonemic,  phonotactic,  and  morphological  language  modeling? 

•  how  well  does  the  lexicon  reflect  an  open  corpus  for  these  distributions? 

•  can  we  anticipate  characteristics  of  difficult-to-obtain  corpora;  e.g.  non-orthographic  languages 
that  whose  only  written  appearance  is  in  unmarked  informal  social  media? 

•  how  small  a  dataset  will  still  produce  a  useful  language  model? 

•  can  we  devise  stopping  rules  for  minimally  useful  sample  sizes?  Can  we  tell  when  we  have 
enough? 

•  what  types  of  information  can  be  meaningfully  aggregated  between  small  language  samples? 
When  can  we  define  clusters  of  related  languages  for  which  this  is  appropriate? 

•  how  many  cognate  pairs  are  required  to  induce  enough  parent  proto-forms  -  implicitly,  regular 
rules  for  sound-change  or  morphological  variation  -  to  accurately  remodel  existing  data? 
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3.  Methods,  Assumptions,  and  Procedures 

Data  sources  and  grades 

We  rely  primarily  on  published  materials,  although  in  some  cases,  linguists  will  share  unpublished  texts 
or  data.  While  born-digital  publication  and  distribution  has  become  more  common  in  the  past  few  years, 
most  sources  are  traditionally  printed  (or  in  the  case  of  some  unpublished  field  notes,  handwritten). 
Nearly  all  of  these  resources  provide  transcribed  forms  and  glosses,  and  were  elicited  for  language 
survey,  sketch,  or  comparative  research  applications.  Use  of  ordinary  dictionaries  is  uncommon. 


papers 

MKSJ,  LTBA,  NUSA,  JSEALS,  OL,  PL,  other 

theses 

world-wide,  including  many  Thai,  Chinese,  other 

surveys 

may  cover  closely  related  lects;  e.g.  Myanmar 

sketches 

particularly  extensive  in  Southern  China 

gray  literature 

informally  published,  not  widely  distributed 

field  notes 

often  unpublished  /  only  available  source 

comparative 

Shorto,  Blust,  Sidwell,  Ratliff,  Matisoff,  Gedney,  other 

e-resources 

MKLP,  STEDT,  ACD,  ABVD 

extent 

ideally  2,500,  but  Swadesh  if  necessary 

quality 

best  available  resource,  but  mileage  will  vary 

Table  2  Typical  data  sources  and  characteristics. 


We  use  the  rough  nomenclature  in  Table  3  to  describe  data. 


vapor 

we’ve  heard  of  it,  but  haven’t  seen  it 

water 

untranscribed  audio  only 

paper 

paper  or  pdf,  not  transcribed  or  extracted 

tin 

dictionary  e-data:  orthography  and  definitions 

copper 

comparative  /  survey  e-data:  forms  and  glosses 

bronze 

some  vanilla  algorithms 

naive  normalization  of  forms  /  glosses, 

some  cognate  sets 

silver 

customized  machine  processing,  machine-usable,  but  not  verified 

gold 

human-verified,  machine-usable,  comparable  datasets 

Table  3  Informal  nomenclature  used  to  describe  data  quality.  Our  “silver”  is  in  fact  linguist-verified  and  “good  as  gold"  for 
all  practical  purposes  -  we  are  delaying  “gold”  assignment  until  the  sets  are  rolled  out  to  the  wider  linguistic  community. 

CRCL  brings  all  copper-standard  data  to  the  program:  data  transcribed  as-is,  provided  in  Unicode,  with 
nothing  beyond  incidental  normalization. 

3a.  Comparative  coverage 

A  number  of  open-access  databases  provide  linguistic  data,  but  their  coverage  of  the  Asia-Pacific  region 
tends  to  be  limited  in  breadth  (few  languages  are  covered)  and/or  depth  (coverage  is  superficial).  This 
comparison  was  conducted  in  May,  2015,  and  relies  on  family  grouping  of  ISO  codes  per  Ethnologue  18 
[Lewis  2016]  (results  from  Glottolog  [Hammarstrom  2016]  would  be  very  similar),  or  the  sources’  own 
internally  reported  grouping  (helpful  for  WALS  [Dryer  2013],  which  does  not  always  map  its  data  to 
ISO  639-3  codes). 
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Linguistic 

Data 

ISO  639-31 

CRCL 

Y1 

CRCL 

Y42 

WALS 

(2679) 

WALS3 
>  25/10% 

PanLex4 

(5963) 

PanLex 

>200 

ASJP5 

(4401) 

PHOIBLE6 

(2105) 

WPD7 

Austronesian 

1257 

109 

626 

325 

42/160 

1060 

391 

805 

42 

718 

Austroasiatic 

170 

30 

85 

47 

9/23 

125 

20 

93 

43 

90 

Hmong-Mien 

38 

19 

19 

5 

1/3 

21 

5 

15 

3 

15 

Kra-Dai 

95 

24 

48 

17 

3/7 

69 

9 

48 

12 

33 

Sino-Tibetan 

474 

108 

242 

146 

21/87 

245 

24 

165 

70 

208 

Total 

2034 

290 

1017 

540 

76/280 

1520 

449 

1126 

170 

1058 

1  ISO  item  counts  are  based  on  the  Ethnologue  18  analysis.  There  are  very  small  inconsistencies  in  all  counts  shown  because 
additions,  deletions,  and  modifications  to  ISO  639-3  are  not  always  migrated  to  the  sources,  or  because  there  was 
uncertainty  or  disagreement  about  language  identification. 

2  Figures  in  the  Y4  column  reflect  potential  CRCL  milestone  requirements  for  40-50%  ISO  639-3  coverage.  Actual  coverage 
of  AA/HM/KD  will  probably  be  nearly  complete. 

3  These  figures  show  depth  of  coverage.  WALS  has  194  feature  categories;  we  list  the  number  ofWALS  datasets  that  have 
data  for  at  least  25%  and  10%  of  the  WALS  feature  set. 

4  The  PanLex  [Kamholz  2014]  sets  in  Asia-Pacific  are  predominantly  very  small  samples  (50%  have  fewer  than  45  items). 
Returned  sets  appear  to  be  rough  synonym  sets,  and  there  is  no  attempt  to  normalize  notation,  or  differentiate  between 
orthography  and  phonological  transcription.  Cited  figures  in  the  >200  column  count  only  the  largest  language  variety 
within  any  ISO  code  (these  figures  are  typically  inflated  by  double-counting  of  the  same  items  from  multiple  sources;  e.g. 
ASJP  and  the  ASJP  source). 

5  The  ASJP  [Bakker  2009]  sets  contain  a  maximum  of  40  words  per  led,  written  in  a  reduced  phonological  transcription. 
They  are  also  included  (and  often  provide  the  main  data  for)  the  PanLex  distribution 

6  PHOIBLE  [Moran  2014]  provides  lists  of  phonological  segments  with  detailed  source  documentation. 

7  The  World  Phonotactics  Database  [Donohue  2013]  summarizes  phonotactic  restrictions  (e.g.  “Is  the  coda  preferentially  a 
nasal?”)  as  +/-  binary  features,  or  counts  (e.g.  “Total  vowels”).  It  does  not  provide  lexical  items  or  transcribed 
phonological  data. 

Table  4  Limited  language-family  coverage  of  currently  available  resources. 


A  variety  of  projects  and  organizations  attempt  to  provide  or  find  ordinary  text  data  for  as  many 
languages  as  possible.  It  is  helpful  to  bear  in  mind,  however,  that  the  most  readily  accessible  online 
texts  for  low-density  language  are  often  religious  tracts.  Like  many  low-density  language  Wikipedia 
pages,  they  often  have  a  high  proportion  of  transliterated  names  and  toponyms  that  may  skew  language 
modeling  unless  detected. 

The  An  Crubaddn  project  supplies  orthographic  trigram  models  for  language  identification,  as  well  as 
word  and  word  bigram  frequencies,  and  links  to  the  discovered  text  sources  [Scanned  2007].  It  is 
possible  that  the  paucity  of  sources  for  Asia-Pacific  texts  is  due  to  our  inability  to  properly  seed  Web 
crawlers  for  these  texts,  or  to  accurately  identify  them  when  they  are  found. 
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Corpus  data 

ISO  639-3 

CRCL 

Y1 

CRCL 

Y4 

Scannell 
(2124) 1 

CRCL Y1 n 

Scannell 

UN 

(428) 2 

Relig 
(426) 3 

Austronesian 

1257 

109 

626 

267  (281) 

59 

32 

116 

Austroasiatic 

170 

30 

85 

14(14) 

2 

7 

0 

Hmong-Mien 

38 

19 

19 

5(7) 

3 

3 

0 

Kra-Dai 

95 

24 

48 

6(8) 

5 

4 

3 

Sino-Tibetan 

474 

108 

242 

67  (72) 

23 

27 

0 

Total 

2034 

290 

1017 

359  (382) 

92 

73 

119 

1  See  the  project  /  download  page  at  http://crubadan.org.  The  corpus  base  appears  to  have  been  updated  most  recently  in 
2015.  Figures  in  parentheses  were  derived  by  counting  ISO  codes  on  the  site.  Some  of  these  have  been  retired,  but  data 
appears  to  have  been  migrated  properly.  The  next  column  looks  at  the  intersection  between  CRCL’s  Y1  deliverables  and 
Scanned ’s  data  ( included  in  our  distribution ) 

2  United  Nations  Declaration  of  Human  Rights  (xml  files  available  at  http://unicode.org/udhr/downloads.html) 

3  The  Watchtower  (http://jw.org)  has  links  for  671  lect-specific  pages  (with  fewer  distinct  ISO  codes);  we  have  not  finished 
identifying  ISO  codes  for  these.  eBible.org  ( http.V/ebible.  org )  links  to  545  ISO-specific  resource  sets.  It  is  likely  that  the 
Scanned  totals  incorporate  most  of  what  might  be  found  separately  from  strictly  religious  sources. 

Table  5  Text  corpus  availability  for  the  AA,  AN,  HM,  KD,  and  ST  language  families  -  coverage  is  about  17.5%. 

3b.  Metadata 

Additional  metadata  can  be  associated  with  each  word  list.  This  includes: 

•  bibliographic  source  metadata:  the  original  text,  author,  publisher,  and  other  publication  details. 

•  language  metadata:  this  includes  the  ISO  639-3  code  and  name,  an  (idealized)  speaker  location, 
speaker  population,  and  linguistic  subgroup  details.  Aside  from  the  ISO  code  and  name,  all  of 
this  information  is  the  result  of  an  independent  analysis  of  some  sort.  The  most  authoritative  and 
fully  developed  analyses  have  been  developed  by  Etlmologue  and  Glottolog-,  the  former  is  partly 
open-access  and  partly  licensed,  while  the  latter  is  open-access.  We  provide  information  from 
both.  However,  because  Ethnologue  GIS  data  may  not  be  redistributed,  we  locate  and  supply  the 
nearest  populated  place  instead. 

•  doculect  metadata:  information  provided  by  the  author  to  help  identify  the  published  lect;  this 
may  include  a  location,  the  author’s  (or  speaker’s)  name  for  the  language,  a  dialect  name,  and 
details  about  the  informant.  To  the  best  of  our  ability  we  add  details  about  the  notation  (e.g.  IPA, 
formal,  informal )  and  analysis  (e.g.  phonemic,  broacl,  phonetic)  used  for  transcription.  Doculect 
metadata  is  the  basis  of  the  registration  of  each  dataset’s  DOI  (digital  object  identifier). 

We  take  different  approaches  to  providing  the  metadata:  it  may  be  cross-referenced  by  any  dataset  that 
requires  it  (e.g.  used  as  standoff  annotation),  or  some  or  all  metadata  can  be  baked  into  each  and  every 
set.  Please  let  us  know  if  a  custom  formulation  may  be  helpful.  Figure  2,  below,  shows  a  typical 
metadata  set. 

3c.  Dataset  identification  and  logical  tables 

For  various  reasons  a  single  logical  lexicon  or  collection  of  lexicons  may  be  broken  up  into  separate 
pieces  in  a  printed  work.  For  example,  in  some  short  survey  lists  each  page  contains  all  forms  for  a 
single  language  without  glossing.  For  longer  lists,  each  page  may  cover  only  a  few  words  (one  per 
column,  with  one  language  per  row),  or  many  (with  one  word  per  row,  with  languages  labeling  columns 
on  one  or  two  pages).  And,  in  some  cases,  a  single  set  of  lists  may  be  split  into  many  tables,  as  when  the 
author  is  making  a  case  for  a  proto-language  reconstruction. 

We  conceive  of  all  the  lects  in  a  given  text  as  forming  a  single  logical  table  when  this  perspective 
benefits  the  user;  generally,  if  they  share  essentially  the  same  gloss  list.  In  a  logical  table,  lects  always 


9 


label  the  columns,  and  glosses  always  label  the  rows,  even  if  the  printed  work  reverses  this  order.  This 
allows  us  to  uniquely  identify  each  lect  with  a  bibref  and  column  number ,  where  the  bibref  is  the 
author’s  last  name,  the  publication  year,  and  the  first  non-stop  word  of  the  title.  The  language  name 
appears  in  the  final  position  for  non-English  publications,  and  in  cases  where  a  series  of  similar  titles 
would  be  confusing. 

On  occasion,  a  single  text  may  contain  more  than  one  logical  table;  as  when  two  sets  of  lects  have 
substantially  different  gloss  lists,  present  data  from  different  families,  or  different  in  the  content  or 
presentation  of  data.  In  such  cases  a  number  is  added  to  the  bibref:  bibref _1,  bibref _2.  Column 
numbering  restarts  with  1  in  each  table.  Note  that  not  all  columns  are  necessarily  transcribed  or 
provided  as  part  of  CRCL’s  LORELEI  data. 

<dataset  id="huf fmanl971vocabulary . cl"> 

<metadata> 

<reference> 

<id>huf fmanl971vocabulary</id> 

<doi>15144/huf fmanl971vocabulary</doi> 

<creator>Huffman,  Franklin</creator> 

<title>Unpublished  vocabulary  lists</title> 

<date>1971</date> 

<publisher>Huf fman  Papers ,  sealang . net/archives/huf fman</publisher> 

<lects>18</lects> 

</reference> 

<language> 

<languageCode  scheme= " iso63 9—3 " >khm</languageCode> 

<languageName  scheme="iso639-3">Central  Khmer</languageName> 

<latLong  source="Ethnologuel8 ">12 .4671,104. 5699</latLong> 

<latLong  source="Glottolog2 . 6">12 . 0515, 105 . 015</latLong> 

<country  source="Ethnologuel8 ">Cambodia</country> 

<country  source="Glottolog2 . 6">Cambodia</country> 

<adm  level="l"  source="Ethnologuel8">Kampong  Chhnang</adm> 

<adm  level="l"  source="Glottolog2 . 6">Kampong  Cham  Province</adm> 

<population  source="Ethnologuel8 ">14224500</population> 

</language> 

<doculect> 

<id>huf fmanl 971 vocabulary . cl</id> 

<doi>15144/huf fmanl971vocabulary . cl</doi> 

<creator>CRCL</ creator> 

<date>2015</date> 

<notation>IPA</notation> 

<analysis>broad</analysis> 

<forms>887</ forms> 

</doculect> 

</metadata> 

Figure  2:  a  typical  metadata  set,  showing  the  bibliographic  reference,  language,  and  doculect  sections.  These  may 
be  packaged  together  with  a  dataset,  or  separately  as  part  of  a  text  and  data  bibliography. 


3d.  Defective  entries 

A  raw  data  entry  may  be  excluded  from  the  distribution  set  for  various  reasons,  including: 

•  the  gloss  could  not  be  reliably  translated,  or  there  was  no  reasonable  WN  3.0  equivalent  or 
extension  available  for  the  gloss  (this  sometimes  occurs  for  phrasal  entries), 

•  the  form  could  not  be  reliably  normalized  or  analyzed  (this  sometimes  occurs  when  the  form 
includes  markup  or  typographical  errors). 

We  can  arrange  to  pass  defective  entries  through  if  desired. 

3e.  Morphological  information 

With  rare  exceptions,  of  the  five  language  families  we  cover  only  Austronesian  has  active  inflectional 
morphology.  As  a  rule,  the  datasets  we  provide  do  not  regularly  mark  morphology.  Any  markup  that  is 
provided  is  explicitly  supplied  (generally  using  hyphens,  or  an  occasional  parenthesized  affixes)  in  the 
raw  form  without  further  information  or  analysis. 
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Some  of  the  Sino-Tibetan  data  marks  apparent  etymological  affixes.  This  was  usually  added  to  the 
source  data  by  the  STEDT  project  [Matisoff  2010]  in  the  course  of  their  attempts  at  reconstruction  of 
proto-Sino-Tibetan.  These  markers  are  retained  in  the  raw  forms,  but  should  not  automatically  be 
understood  to  be  the  result  of  methodical  morphological  analysis. 

In  the  non-Austronesian  families,  the  use  of  class  terms,  particles,  phonological  and  semantic 
doubling,  and  other  word-compounding  processes  provides  a  type  of  morphology.  These  will  be 
segregated  in  due  course  as  we  group  cognate  sets. 

3f.  Normalization  and  standardization  of  glossing 

Most  of  our  datasets  use  glosses  to  indicate  the  words  used  to  elicit  forms  from  native  speakers,  rather 
than  to  define  and/or  explain  known  native-language  words.  Frequently,  standardized  elicitation  lists 
are  used.  Unfortunately,  many  glosses,  standardized  or  not,  are  open  to  slight  reinterpretation  by  any 
given  linguist  or  informant.  Hence,  normalization  of  glossing  is  neither  trivial  nor  certain.  In  most 
applications,  small  differences  between  the  gloss,  and  the  item’s  “true”  semantics,  will  not  be  critical: 

•  survey  and  comparative  lists  are  used  to  elicit  central,  core,  universal  semantic  concepts;  not 
subtle  distinctions.  Hence,  the  word  is  not  likely  to  contrast  with  other  semantically  linked 
words  in  the  list;  e.g.  “stone”  as  an  object  versus  a  material,  or  “throw”  versus  “toss”  or  “fling.” 

•  part-of-speech  categories  (and  the  variation  in  English  gloss  form  they  might  require)  may  be 
determined  by  context,  particularly  in  non-Austronesian  families.  We  rely  on  conventional 
choices,  e.g.  “blue”  and  “heavy”  are  adjectives. 

•  despite  subtle  differences  from  the  raw  gloss,  the  normalized  gloss  reliably  aligns  with 
etymologically  related  items  in  other  word  lists,  and  is  able  to  support  downstream  applications 
for  cognate  identification,  distance  measurement,  lexicon  extension,  phonological  modeling,  and 
so  on. 

We  normalize  to  WordNet  3.0  senses,  because  it  is  a  mature,  well-developed,  and  widely  used 
resource,  replete  with  analytical  tools,  and  linked  to  many  other  lexical  resources.  Hierarchical 
relations,  well-defined  sense  definitions,  and  corpus-based  sense  counts  also  help  make  WN  its  own 
disambiguation  tool.  Nevertheless,  WordNet  has  gaps.  It  does  not  define  closed-set  vocabulary  items, 
nor  does  it  recognize  the  regular  patterning  of  some  lexical  items  (in  particular,  kin  terms)  that  figure 
heavily  in  comparative  and  survey  wordlists. 

Unavoidably,  there  are  also  differences  in  the  way  English  and  other  languages  lexicalize  concepts, 
actions,  or  things;  e.g.  “hand/arm”  and  “blue/green”  are  indivisible  lexical  items  in  much  of  Asia- 
Pacific.  And,  in  some  cases,  we  are  not  sure  whether  or  not  a  lexical  gap  exists.  For  example,  “big 
basket”  might  be  a  noun  with  modifier,  a  single  lexical  item  distinct  from  a  small  basket,  or  just  the 
standard  word  used  for  baskets  (i.e.  the  elicitation  list  might  request  “big  basket”  and  “small  basket”  and 
receive  the  same  form  for  both). 

Our  MetaGloss  system  addresses  these  issues. 

•  when  possible,  a  single  WordNet  3.0  sense  is  provided:  house#n#l 

•  when  two  or  more  useful  interpretations  are  plausible,  they  are  pipe-separated: 

bake#v#l  ltoast#v#l . 

•  several  word  classes  have  been  added  (with  all  items  numbered  #1):  d demonstrative), 

j( conjunction ),  k( in  term),  miodal),  p(ronoun).  q( interrogative ),  x  ( temporarily  uncategorized). 

•  when  new  senses  are  added  to  the  WN  a,  n,  r,  v  lists,  they  are  numbered  #0:  armspan#n#0. 

•  a  polysemous  sense  that  does  not  exist  in  English  is  indicated  by  labeling  the  WN  3.0  sense: 
v@fist#n#l  indicates  the  verb  sense  of  the  noun  “fist,”  i.e.  “make  a  fist.” 
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•  kin  terms  are  built  up  in  regular  fashion,  starting  with  the  person  who  is  ultimately  referenced: 
mot.fat#k#l  is  the  mother  of  the  father,  or  the  paternal  grandmother. 

•  senses  may  have  attributes  that  help  document  what  we  believe  is  the  useful  reference  meaning; 
e.g.  carry#v#l: tumpline.  This  indicates  that  for  purposes  of  cognate  grouping  the  item  clusters 
with  “carry”  terms,  but  keeps  “tumpline”  accessible.  These  head+attribute  forms  may  be 
simplified  in  the  future. 

•  classifiers  are  noted  by  the  :clf  attribute,  e.g.  basket#n#l:clf  is  a  classifier  for  baskets, 
several#a#l:clf  for  several  items,  kick#v#l:clf  is  an  instance  of  kicking.  There  may  be  some 
inconsistency  in  the  listing  of  feature-oriented  classifiers  (e.g.  long,  thin  items)  because  it  is  not 
always  clear  if  the  given  form  is  a  classifier,  or  just  an  instance  of  an  item. 

All  senses  used  in  any  distribution  may  be  found  in  the  top-level  metagioss/  directory. 

3g.  Normalization  and  analysis  of  forms 

There  is  an  enormous  amount  of  variation  in  the  way  that  phonological  forms  -  even  for  the  same  items 
-  are  transcribed  in  the  source  data.  This  is  due  to  differences  in: 

•  analysis  a  phonetic  transcription  most  closely  follows  actual  utterances.  An  analyzed  phonemic 
transcription  ignores  allophonic  variation  and  produces  somewhat  idealized  forms.  A  broad 
phonemic  transcription  ignores  obvious  minor  variations,  but  does  not  guarantee  a  minimal 
phoneme  set.  It  is  not  always  possible  to  ascertain  which  analysis  a  transcription  relies  on. 

•  notation  an  1PA  transcription  follows  the  formal  IPA  guidelines  (and  directly  maps  to  Unicode 
glyphs),  with  some  rare  exceptions  and  national  variants.  A  formal  transcription  may  pre-date 
modern  IPA  practice;  it  can  usually  be  mapped  to  modern  IPA.  An  ad  hoc  informal  transcription 
typically  uses  the  roman  alphabet,  but  does  not  always  follow  any  recognized  conventions. 

•  tradition  the  IPA  provides  notation,  but  does  not  define  its  usage.  Some  linguists  will  suppress 
features  they  feel  are  predictable  within  the  language,  while  others  mark  them  explicitly.  It  is  not 
always  possible  to  determine  which  path  has  been  followed. 

CRCL’s  MetaForm  normalization  has  a  dual  goal: 

•  to  make  data  comparable,  despite  have  been  originally  prepared  using  different  analyses, 
notations,  and  traditions, 

•  to  add  an  explicit  analysis,  often  based  on  our  knowledge  of  the  individual  language,  that  will 
benefit  downstream  applications  such  as  cognate  alignment,  language  distance  measure,  and 
audio  segmentation. 

We  accomplish  this  dual  goal  by: 

•  normalization:  translation  into  appropriate  IPA  notation, 

•  syllabification:  marking  of  syllable  boundaries,  which  is  often  needed  for  proper  segmentation, 

•  sub-syllabification:  marking  of  onset,  nucleus,  and  coda  syllable  segments, 

•  segmentation:  division  into  individual  phonological  segments  -  logical  single-character  entities 
that  cannot  always  be  represented  in  IPA  /  Unicode, 

•  feature  analysis:  specification  of  the  phonological  features  of  each  segment,  and 

•  role  analysis:  specification  of  the  position  /  phonotactic  role  of  each  segment. 

For  example,  the  imaginary  raw  form  Imboal  may  actually  vary  in  length  from  one  (/"'boa/)  to  three 
(An1  bo  a/)  syllables.  The  leading  /m/  might  be  prevocalized  (/end),  unvocalized  (/"'/),  or  vocalized 
(An/),  according  to  implied  phonotactic  restrictions.  Similarly  the  language  might  allow  or  forbid 
diphthongs.  MetaForm  makes  any  analysis  we  are  able  to  provide  explicit. 
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Four  characters  -  I  ]  y  l  y  t  -  that  are  not  strictly  IPA  (but  which  could  be  replaced  by  IPA 
sequences)  are  retained  because  they  are  widely  used  in  the  region’s  modern  notation.  In  effect,  they  fill 
gaps  that,  arguably,  the  IPA  could  have  provided.  One  additional  character  -  /  v  /  -  is  used  as  the  high, 
back,  rounded,  fricated  vowel.  It  appears  variously  in  the  literature  as  Ivl  with  an  over/under  diacritic 
(e.g.  Ivl),  and  there  is  no  formal  (or  ideal,  albeit  informal)  IPA  alternative  (e.g.  /  u|J'  /  J3  /  J3  /) . 

Syllable  boundaries  cannot  always  be  determined.  In  some  cases  linguists  disagree,  and  in  others  we 
do  not  have  the  information  required  to  recognize  that,  for  example,  a  l-tt-l  sequence  should  be  a 
geminate  l-t:l  rather  than  l-t  t-/.  To  help  minimize  the  consequences  of  an  incorrect  choice,  we  provide 
all  items  both  in  fully  tokenized  form,  and  in  a  simpler  rendering  as  phonological  segments.  From  an 
earlier  example: 

<f orm  st atus=" silver "  style="tokenized"> : . h . :  u . . . ! : . n . : | 55 

:  .  th . :  | i :  .  . .  ]  [  l4</ form> 

<form  status  =  "silver"  style=" segmented">h .  u  .  n  . 55  th . i :  .  l4</f orm> 

The  tokenized  form  is  easily  rendered  as  sub-syllablic  ngrams,  while  the  segmented  form  is  trivially 
converted  into  ngrams  of  phonological  segments  or  features. 

3h.  Feature  analysis 

CRCL’s  feature  analysis  is  shown  in  the  Appendix,  and  partly  summarized  below.  This  table  drives  all 
feature  assignments,  and  is  designed  for  clarity  in  tagging  tokens,  and  convenience  in  downstream 
applications.  It  does  not  account  for  all  possible  linguistic  behavior  worldwide,  but  intentionally 
limiting  its  scope  to  features  characteristic  of  our  five  language  families  of  interest  helps  reveal  errors  in 
data  input  or  analysis:  they  require  impossible  tokenization  or  feature  assignments.  All  token-to-feature 
assignments  are  unambiguous  and  reversable.  Note  that  some  phonotactic  information  (e.g.  role  and 
position)  is  built  in. 


Category 

Attributes 

class 

consonant,  vowel,  syllabic,  minor 

role 

onset,  nucleus,  coda 

position 

core,  post 

length 

epenthetic,  short,  long 

pre-articulation 

prenasalized,  devoiced,  preglottalized,  preaspirated,  prelabialized,  prestopped 

height 

high,  near-high,  close-mid,  mid,  open-mid,  near-low,  low 

backness 

front,  near-front,  central,  near-back,  back 

place 

bilabial,  labiodental,  dental,  alveolar,  retroflex,  palatoalveolar,  alveolopalatal,  palatal, 
labiopalatal,  velar,  labiovelar,  uvular,  pharyngeal,  glottal 

manner 

nasal,  stop,  implosive,  affricate,  fricative,  approximant,  tap-flap,  trill 

realization 

rounded,  voiced,  retroflexed,  lateralized,  fricated,  nonvocalized,  prevocalized,  vocalized 

phonation 

nasal,  aspirated,  devoiced,  breathy,  creaky,  dental,  raised,  lowered,  rhotic 

post-articulation 

nasalized,  globalized,  palatalized,  labialized,  labiopalatalized,  stopped,  velarized, 
pharyngealized 

Table  6  Main  features  of  CRCL’s  phonological  feature  analysis.  This  is  provided  in  full  in  the  Appendix. 


The  class  attributes  syllabic  and  minor,  and  their  associated  realization  features  nonvocalized, 
prevocalized,  and  vocalized,  are  specifically  intended  to  address  the  problem  of  inconsistent  notation  of 
unstressed  onset  syllables  ( sesquisyllables )  widely  found  throughout  the  region,  e.g.  Ikkal,  Ika  kal,  Ik  kal. 
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tkkat,  /k.ka/,  Ik*  kat .  As  a  rule,  when  onsets  clearly  violate  the  sonority  sequence  principle,  we  treat 
them  as  minor  syllables,  without  overt  vowels,  whose  vocalization  might  or  might  not  be  inferable  from 
our  knowledge  of  the  language  and/or  the  author’s  transcription  practice. 

This  has  a  number  of  advantages,  not  the  least  of  which  is  simplifying  automated  cognate  segment 
alignment  and  distance  measurement.  One  consequence  -  which  we  accept,  because  it  is  characteristic 
of  all  families  that  we  work  with  -  is  that  complex  onsets  that  violate  sonority  are  not  seen.  We  accept 
this  with  the  understanding  that  this  analysis  may  be  extended  in  other  areas  of  the  world. 

3i.  Phonodynamic  inventories  and  ngrams 

Phonodynamic  analysis  datasets  supply  lect-by-lect  surveys  of  phonological  segments,  their  positions 
within  syllables  and  words,  and  various  statistical  measures.  They  allow  the  inference  of  phonotactic 
restrictions  on  (or  preferences  for)  segment  collocations.  However,  it  is  important  to  understand  that 
these  are  purely  data-driven.  They  should  inform,  rather  than  substitute  for,  a  formal  analysis. 

We  supply  two  basic  phonodynamic  dataset  types;  one  of  tokens,  and  one  of  features.  For  the 
moment,  they  are  both  in  TSV  (not  XML)  form.  Below,  a  token  survey  (a  similar  table  laid  out  by  rows 
is  also  provided)  that  shows: 

•  counts  for  sub-syllable  tokens:  the  complete  nucleus,  onset  (onCC),  coda  (codCC),  and  tone 
contour), 

•  counts  for  individual  segments,  by  position  (for  consonants)  or  value  (for  vowels), 

•  summary  counts  of  each  syllable  canon. 
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Figure  3  Counts  from  sketch-cols.  tsv.  This  provides  a  quick  overview  of  phonological  and  sub-syllabic  segments. 

The  second  basic  type  provides  a  segment-by-segment  feature  inventory,  also  with  positional  counts. 

•  counts  for  each  token,  by  position:  1-4  for  vowels,  or  onset,  coda,  or  minor  syllable  onset  or 
coda, 

•  a  tabulation  of  each  segment’s  phonological  features:  length,  pre-articulation  (e.g.  pre¬ 
nasalization),  height,  back,  place,  manner,  realization  (e.g.  rounding,  voicing),  phonation 
(aspiration,  creak,  etc.),  and  post-articulation  (e.g.  palatalized  or  glottalized). 

•  summary  counts  of  all  n-thongs,  onsets,  codas,  tones,  and  syllable  canons  are  also  provided. 
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Figure  4  Counts  from  sketch-features.tsv.  This  provides  an  overview  of  segment  features  by  position,  and  multi-segment 
onset,  nucleus,  and  coda  sections. 

Many  statistical  measures  of  feature  significance  are  calculated.  Because  these  are  based  on  simple 
calculations  using  unweighted  samples,  they  must  be  viewed  as  extremely  rough  indicators.  They 
include: 

•  diphone/triphone  frequency  vectors:  their  orthographic  equivalents  are  very  effective  for  text 
language  identification;  it  is  not  clear  if  wordlist  distributions  are  enough  to  characterize 
language  similarity.  We  generate  these  for  both  segments  and  specific  features  (e.g.  consonant 
place  and  vowel  back  collocations). 

•  functional  load:  a  measure  of  the  segment’s  information  content;  how  necessary  is  it  to 
uniquely  identify  its  context?  We  calculate  this  as  the  segment’s  number  of  contrastive  /  total 
appearances;  i.e.  the  number  of  times  that  the  segment  must  be  known  to  disambiguate  a  lexeme 
divided  by  its  total  appearance  count.  (See  also  [Surendran  2003,  2006].) 

•  salience:  the  equivalent  of  inverse  document  frequency  [Sparck-Jones  1972];  how  well  does  a 
particular  segment  or  collocation  identify  a  language?  By  treating  each  language’s  list  of 
segments  as  a  document,  we  can  define  each  document  collection  as  the  set  of  languages  within  a 
given  geographical  (i.e.  nOO-mile  radius)  or  etymological  (e.g.  sub-branch  sisters)  distance  from 
the  target  language.  Thus,  salient  segments  may  provide  geographic  shibboleths,  or  evidence  of 
shared  etymological  innovation  or  loans. 

•  neighborhood  and  clustering  coefficient:  how  closely  linked  (i.e.  varying  from  one  another  by 
a  single  feature  or  segment)  are  the  words  in  a  language,  and  what  is  each  word’s  phonological 
neighborhood ?  [Vitevitch  2007,  Luce  1998]  Because  we  expect  sound  changes  to  be  regular, 
we  expect  neighborhoods  to  be  recognizable  even  if  surface  forms  vary.  Thus,  this  data  can 
serve  as  a  proxy  for  language  divergence. 

•  wordlikeness:  how  well  does  a  word  reflect  both  the  phonological  distributions  and  phonotactic 
constraints  of  a  given  language? 

We  have  extracted  a  series  of  unigram  and  ngram  sets  from  the  data,  by  lect.  These  include: 

•  phonological  segment  bi-  and  trigrams:  implicit  blanks  before  and  after  each  word  are  treated  as 

Segments.  (2_segment  .tsv,  3_segment .  tsv) 

•  segment(s)  plus  nucleus  bi-  and  trigrams:  these  treat  the  nucleus  as  a  single  phonological 

Segment.  (2_segment_nuc.tsv,  3_segment_nuc . tsv) 
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•  sub-syllabic  (onset  /  nucleus  /  coda  and  coda  /  onset)  bi-  and  trigrams:  again,  implicit  pre-  and 
post-syllable  blanks  are  treated  as  tokens.  (2_token.tsv,  3_token.tsv) 

•  onset  or  nucleus  plus  tone  collocations:  these  are  only  calculated  for  tone  languages. 

(2_onset_tone . tsv,  2_nucleus_tone . tsv) 

•  feature  trigrams:  these  separately  track  (consonant)  place  and  (vowel)  backness,  and  (consonant) 
manner  and  (vowel)  height.  ( 3_place_back .tsv,  3„manner_height  .tsv) 

•  functional  load,  by  phonological  segment:  these  count  appearances  and  contrasts,  and  calculate 
load  (load.  tsv). 

Other  ngrams  can  be  extracted  on  request. 

3j.  Lexical  analytics:  contrast,  cover,  neighbor,  wordlikeness 

Lexical  analytics  describe  the  relationship  between  forms,  and  between  forms  and  the  full  lexicon.  We 

have  extracted  min  contrast  and  min  cover  sets  for  each  doculect: 

•  minimal  contrast  sets  are  items  that  differ  by  single  phonological  segment  pairs,  and  are  useful 
for  establishing  formal  phonemic  analyses;  i.e.  recognizing  allophonic  variation.  We  list  these 
by  segment  pair,  including  the  null  (e.g.  ball,  all )  segment,  (contrast.txt) 

•  minimum  cover  sets  are  lists  of  words  that,  together,  include  all  segments.  These  are  not  unique; 
more  than  one  possible  list  may  include  all  segments.  This  is  a  computationally  expensive 
operation;  we  employ  a  greedy  algorithm  that  is  almost  certain  to  return  the  shortest  possible  list. 

(cover.txt) 

•  neighbor  sets  treat  each  word  as  the  central  node  in  a  graph;  each  edge  represents  a  distance  of 
one  phonological  segment.  We  calculate  the  neighborhood  density,  number  of  edges,  and 
clustering  coefficient  (number  of  links  between  the  neighbors),  (density  .tsv) 

•  wordlikeness  indicates  how  well  a  word  matches  the  phonological  distributions  and  phonotactic 
restrictions  of  the  lexicon  as  a  whole.  Although  more  typically  used  to  evaluate  pseudowords, 
this  measure  can  assist  language  identification,  (wordlike .tsv) 

3k.  Related  text  data 

When  available,  we  have  included  corresponding  data  from  Scanned’ s  An  Crubaddn  project;: 

•  trigram  grapheme  lists,  including  implicit  onset  and  follower  spaces, 

•  monogram  and  bigram  wordlists, 

•  source  URLs  (Scanned  does  not  release  the  original  texts,  but  provides  the  links  needed  to  scrape 
them). 

These  sets  have  several  applications: 

•  language  subgrouping:  distance  measures  between  ngrams  (e.g.  cosine  distance)  can  be  used  to 
generate  trees  of  language  relations. 

•  ortho-to-phono  and  vice  versa:  the  phonological  sets  can  help  build  conversion  tools  when  used 
in  conjunction  with  orthographic  ngrams,  Among  other  applications,  these  will  help  answer  the 
question  of  just  how  wed  the  lexicon  reflects  the  language  as  seen  in  a  text  corpus. 

•  language  identification:  it  is  an  open  question  whether  ngrams  encapsulate  the  same  kind  of 
phonotactic  information  that  humans  rely  on  for  rapid  language  identification. 

We  very  much  want  to  extend  available  text  data  beyond  those  sets  trivially  identified  by  BCP-47  style 

script  codes,  or  found  in  Wikipedia  pages;  see  Web  corpus  acquisition  in  the  Applications  section. 
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31.  Cognate  sets 

Cognate  sets  are  provided  as  standalone  XML  entries  (figure  5).  All  cognate  relations  are  tabulated  in 
cognates/grid. tsv  ,  which  is  essentially  a  table  whose  rows  are  ISO  639-3  codes,  and  whose  columns  are 
rough  historical  glosses,  given  as  WordNet  senses.  Sets  of  corresponding  items  from  two  or  more 
languages  are  suitable  as  training  data  for  applications  like  inference  of  regular  sound  change 
correspondences,  and  lexicon  extension. 

<cognate  id="huf fmanl 97 1 vocabulary :C:cl3.r625.gs2041.i8527"  iso639-3=" lbo" 
lang="Laven"> 

<etygloss>roast#v#l</etygloss> 

<etyset>AA: S2041</etyset> 

<form>buh</ form> 

</ cognate> 

Figure  5  A  typical  cognate  entry.  The  id  provides  a  unique  link  to  a  data  item.  Language -related  details  are  baked  in  for 
convenience,  and  can  be  extended  if  desirable. 

The  <etygioss>  element  provides  a  nominal  index  term  for  all  of  the  cognate  clusters  with  the  same 
rough  semantics.  This  is  a  term  of  convenience,  and  might  not  actually  reflect  the  meaning  of  the  proto¬ 
form.  The  <etyset>  element  identifies  the  proto-form’s  nominal  family  source  (here,  Austroasiatic), 
and  numbers  the  cognate  cluster.  When  possible,  the  number  refers  to  an  established  cognate  set  from 
the  literature.  Here,  S2041  refers  to  Shorto’s  set  2041.  Our  current  reference  set  includes: 

•  AA  Austroasiatic  [Shorto  2006] 

•  AN  Austronesian  [Blust  2010,  Wolff  2010,  Greenhill  2008] 

•  HM  Hmong-Mien  [Ratliff  2010] 

•  KD  Kra-Dai  [Hudak  2008,  Pittayaporn  2009,  Weera  2000,  Norquest  2007] 

•  ST  Sino-Tibetan  [Matisoff  2010] 

Many  cognate  sets  also  have  ad  hoc  identification  numbers  (e.g.  AA:4).  Items  in  these  sets  form  a 
coherent  group  that  is  either  not  reported  in  the  literature  (which  is  hardly  exhaustive),  or  which  will 
probably  be  moved  to  a  different  etygloss  set.  We  derive  cognate  sets  in  the  following  manner: 

•  calculate  the  surface  similarity  between  all  forms  with  closely  related  semantics.  We  use 
Kondrak-style  phonological  similarity,  which  is  robust  in  the  face  of  feature  (vs.  IPA  character) 
variation  [Kondrak  2002], 

•  use  different  clustering  algorithms  (bottom-up  agglomeration,  and  Markov  chain  clustering  [van 
Dongen  2000])  to  form  likely  cognate  groups.  It  is  difficult  to  predict  what  algorithm  and 
parameters  will  create  the  most  realistic  clusters;  we  pre-calculate  a  half-dozen  trial  settings,  then 
choose  a  starting  set, 

•  individually  revise  the  automatically  generated  groups,  adding  references  to  sets  established  in 
the  literature  when  possible. 

Many  cognate  sets  will  be  relatively  small  at  first.  We  may  not  yet  have  data  from  other  languages  in 
the  same  etymological  subgroup,  might  not  have  established  enough  clusters  to  support  claims  regarding 
more  dramatic  phonological  changes,  and/or  have  not  yet  established  a  large  enough  number  of  sets  to 
reliably  merge  groups  that  require  an  argument  for  semantic  shift. 

Formal  cognate  relations  are  not  always  needed  to  compare  wordlists  from  sister  languages  that  are 
known  to  be  etymologically  close,  particularly  if  they  have  been  elicited  using  the  same  glosses. 
Anybody  can  perform  the  same  item-by-item  distance  measure,  using  their  own  cutoff  rule  of  thumb  for 
assumed  cognate  status.  However,  this  simple  approach  becomes  progressively  less  reliable  as  the 
distance  between  languages  increases,  or  as  individual  linguists’  practice  in  data  collection  varies. 

Finally,  we  mention  in  passing  that  formal  Swadesh  lists  are  not  intended  to  elicit  cognates,  but  rather 
to  expose  the  rate  of  cognate  replacement.  Nevertheless,  some  comparative  surveys  may  use  Swadish  or 
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similar  elicitation  terms  to  seek  cognates  only.  Each  approach  addresses  different  goals;  our  point  is 
simply  that  one  should  avoid  making  assumptions  about  list  content  and  utility. 

3m.  HA/DR  thesaurus 

The  Ariel  project’s  HA/DR  Topic  Lexicon  lists  roughly  34,000  terms  “ relevant  to  the  HA/DR  topic 
taxonomy  devised  by  DARPA  and  the  LORELEI  evaluation  team.”  We  have  extracted  a  thesaurus  of 
terms  that  appear  both  in  this  list,  and  as  CRCL  metaglosses. 

We  have  further  extended  the  HA/DR  list  by  200+  terms  which  appear  in  our  wordlists  and  appear  to 
be  relevant,  including  kill,  poison,  nauseous,  afraid,  fear,  grave,  blood,  bury ,  hungry,  thirsty,  etc.  These 
all  have  high  negative  scores  in  the  SentiWordNet,  SentiWord,  and/or  Valence,  Arousal,  Dominance 
analyses  [Gatti  2013,  Baccianella  2010,  Warriner  2013].  We  think  these  terms  are  more  likely  to  be 
relevant  in  monitoring  informal  communications  such  as  Twitter. 
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4.  Results  and  Discussion 

Overview 

Datasets  provided  for  the  final  milestone  are  summarized  in  Figure  6,  and  go  well  beyond  the  contract 
requirements. 

Overview  of  LORELEI  data 


Language  family  summary 


Family 

ISOs 

Sets 

Cogs 

Forms 

Total  ISO 

Coverage 

AA 

30 

50 

209 

42633 

170 

17% 

AN 

335 

680 

438 

550142 

1257 

26% 

HM 

19 

34 

458 

14449 

38 

50% 

KD 

23 

54 

249 

71548 

95 

24% 

S 

14 

0% 

ST 

108 

306 

375 

194616 

460 

23% 

Total 

515 

1124 

nfa 

873388 

2034 

25% 

ISO  counts  are  unique  within  each  family  (not  source). 

Cogs  gives  EtySet  (concept)  counts;  each  usually  contains  several  distinct  cognate  groups. 


Source  summary 


Family 

Source  collection* 

ISOs 

Sets 

Forms 

Avg** 

Gloss 

status 

Form 

status 

Notation 

Analysis 

AA 

huffmanl  97 1  vocabulary 

16 

18 

11997 

666 

silver 

silver 

IPA 

broad 

AA 

huffman  1 979vocabulary 

7 

11 

15481 

1407 

silver 

silver 

IPA 

phonemic 

AA 

theraphan2001  languages_1 

10 

14 

9420 

672 

silver 

silver 

IPA 

phonemic 

AA 

theraphan2001  languages_2 

6 

7 

5735 

819 

silver 

silver 

IPA 

phonemic 

AN 

arnaud1997lexique 

34 

36 

33186 

921 

silver 

silver 

formal 

broad 

AN 

reid  1971  Philippine 

40 

43 

17359 

403 

silver 

silver 

IPA 

phonemic 

AN 

reid2016philippine 

49 

79 

33622 

425 

silver 

silver 

IPA 

phonemic 

AN 

stokhof 1 980bolle 

153 

280 

244215 

872 

silver 

silver 

adhoc 

narrow 

AN 

tadmor201 5jakarta 

15 

52 

85925 

1652 

silver 

silver 

IPA 

phonemic 

AN 

tadmor201 5languages 

12 

30 

15024 

500 

silver 

silver 

IPA 

phonemic 

AN 

tryon  1 995comparative 

80 

80 

90438 

1130 

silver 

silver 

formal 

broad 

AN 

yap1977comparative 

73 

80 

30373 

379 

silver 

silver 

IPA 

phonemic 

HM 

ratliff2010language 

11 

11 

4782 

434 

silver 

silver 

IPA 

phonemic 

HM 

wang1995miao 

18 

23 

9667 

420 

silver 

silver 

formal 

phonemic 

KD 

hudak2008comparative 

12 

18 

14163 

786 

silver 

silver 

formal 

phonemic 

KD 

zhang 1999zhuang 

14 

36 

57385 

1594 

silver 

silver 

IPA 

phonemic 

ST 

huang1992tbl 

45 

49 

82632 

1686 

silver 

silver 

formal 

broad 

ST 

Ism2015chin 

23 

142 

59566 

419 

silver 

silver 

IPA 

narrow 

ST 

Ism2015naga 

14 

81 

29081 

359 

silver 

silver 

IPA 

narrow 

ST 

mamson  1 967classificatlon 

28 

34 

23337 

686 

silver 

silver 

adhoc 

narrow 

20  source  files 

660 

1124 

873388 

777 

ISO  counts  are  unique  within  each  source  (not  family)  Minimum  count  for  inclusion  is  100  items 
* Some  source  collections  may  contain  multiple  bibliographic  sources  (bibrefs). 

** Roughly  equals  the  number  of  distinct  glosses  per  elicitation  set 


Figure  6  Overview  of  final  deliverable  set.  As  noted  earlier,  both  glosses  and  forms  are  gold-standard  in  all  but  name  -  we 
feel  that  a  formal  roll-out,  and  comment  period  in  the  linguistics  community,  is  appropriate. 

An  overview  of  the  delivery  hierarchy  is  given  in  Figure  7.  The  project’s  data  delivery  formats 
evolved  rapidly  in  order  to  better  expose  the  content  of  the  data  sets.  Extracting  data  was  not  the  issue; 
rather,  it  was  helpful  to  clarify  the  different  views  and  data  subsets  that  could  be  extracted. 

MetaGloss  and  MetaForm 

There  were  few  surprises  in  regard  to  the  planned  work  of  the  project.  We  set  an  extremely  challenging 
schedule,  on  average  processing  one  ISO  code  per  day,  often  with  two  or  more  lects  per  code. 
Normalizing  to  the  MetaGloss  and  MetaForm  frameworks  required  a  massive  amount  of  effort  simply 
because  even  with  experience  and  computational  assistance,  delivering  >  850,000  items  put  us  at  the 
wrong  end  of  the  lever.  Even  very  low  problem  rates  produced  many,  many  thousands  of  items 
requiring  individual  attention  (and  sometimes  revealing  errors  in  the  original  data  source). 

The  difficulty  of  defining  a  “final”  MetaGloss  standard  came  as  something  of  a  surprise.  While  it  is 
possible  to  restrict  the  content  of  elicitation  sets  (such  as  Swadesh,  various  regional  SIL  survey  sets,  the 
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crcl/  -  root  directory 

./formats  -  description  of  all  document  formats 
./paths  -  grep-able  list  of  paths  to  all  files 
. /tokens . xml,  . /tokens. tsv  -  all  lexical  data 

. /sketch-rows . tsv,  . /sketch-cols . tsv,  . /sketch-features . tsv  -  all  segment/canon/feature  overviews 
. /readme .panlex  -  notes  on  and  aggregated  manifests  for  Panlex  data 
bib/  —  bibliographic  metadata 
. /metadata . xml 

geo/  —  geographically  oriented  data 

./info. geo  -  list  of  family,  ISO-639-3,  county,  and  ADM-1  region  (if  available) 

CN/  -  one  directory  per  country,  ISO  3166-1  alpha-2  codes 
KH/  -  ...  (about  25  countries  in  all) 

./info. geo  —  country  summary  (ADM-1  regions  are  not  always  available) 

. /Champasak . geo  —  one  file  per  ADM-1  region.  These  may  later  be  changed  to  ISO  3166-2  codes. 
. /Preah_Vihear . geo  . . .  etc . 

metagloss/  -  global  data  for  MetaGloss  (WordNet  3.0  glosses) 

. /metagloss . txt  -  all  forms  and  counts  in  use 
./new.txt  -  list  of  new  (sense  0)  items 
./kin. txt  -  explanation  of  the  components  of  kin  terms 
./a. txt  . . .  x.txt  -  lists,  by  part  of  speech,  for  all  items 
cognates/ 

. /cognates . xml  —  single  file  of  all  items  with  tagged  etygloss  and  etyset 
. /setByRow . tsv  —  training  data  table  of  all  cognate  relations  (columns  are  lects) 

. /setByCol . tsv  —  training  data  table  of  all  cognate  relations  (rows  are  lects) 
etygloss/ 

able#a#l/  -  one  directory  per  concept / label .  200+  sets  per  family  Y1  to  500  Y4 
above#r#2/  ...  Not  all  sets  overlap,  and  we  substantially  overshoot  the  targets. 

. /etyset-1 . xml  —  one  file  per  etymologically  related  set;  typically  several 
. /etyset-n . xml  ...  sets  per  concept,  per  family;  e.g.  AA: S638 . xml ,  HM:R837.xml 
hadr/  -  extended  HA/DR-specific  lexicon,  across  all  languages 

./readme. txt  -  discussion  of  HA/DR  item  acquisition  and  form, 
crcl/,  panlex/  -  one  directory  for  each  major  source 
./readme. txt  -  source-specific  notes 
./hadr. tsv  —  comparable  lexicon 

AA/-  one  directory  each  for  Austroasiatic,  Austronesian,  Hmong-Mien,  Kra-Dai,  Sino-Tibetan 
AN/,  HM/,  KD/,  ST/  .  .  . 

alk/  -  one  directory  for  each  3-letter  ISO  639-3  code;  expect  250+  Y1  to  800-1,000++  Y4 
brb/  . . . 

arnaudl9971exique . cl/  -  one  directory  for  each  documented  lect,  where  directories 
arnaudl9971exique . c2/  ...  are  named  as  bibref . column .  500  doculects  Y1  to  2000  doculects  Y4 

. /metadata . xml  -  metadata  for  this  lect 
. /lexicon . xml  —  main  lexicon  file 

. /sketch-cols . tsv  -  sketch  of  segments,  column  view  (easier  to  read) 

. /sketch-rows . tsv  -  sketch  of  segments,  row  view  (easier  to  grep) 

. /features . tsv  -  sketch  of  segments  by  their  features 

. /2_segment . tsv,  . /3_segment . tsv  -  phonological  segment  bi-  and  trigrams 
. /2_segment_nuc . tsv,  . /3_segment_nuc . tsv  -  phonological  segments,  single  nucleus 
. /2_token . tsv,  . /3_token . tsv  -  sub-syllable  token  trigrams  (onset,  nucleus,  coda,  tone) 

. /3_place_back . tsv  -  place/back  feature  trigrams 
. /3_manner_height . tsv  -  manner/height  feature  trigrams 

. /2_onset_tone . tsv,  2_nucleus_tone . tsv  -  onset  /  nucleus  plus  tone  collocations 
./cover. tsv  -  minimum  cover  set 
. /contrast . tsv  -  minimal  contrast  set 

. /density . tsv  —  clustering  coefficient,  links,  degree,  neighbors  for  each  word 
./load. tsv  -  functional  load,  by  segment 
info/  -  other  language  data  relevant  to  the  ISO  639-3  code 
. /metadata . xml  -  metadata  from  Ethnologue,  Glottolog. 

ASJP/  -  one  directory  for  each  wide-coverage  source 

Ethnologue/  . . .  this  anticipates  we  may  rely  on  or  develop  other  sources 
Glottolog/  ...  a  typical  example : 

. /geo_distance . tsv  -  geographical  distance  sets  (0  to  500  km,  by  100km) 

. /ety_distance . tsv  -  genetic  distance  sets  (n  nearest  neighbors) 

. /geo_lexicon . tsv  —  lexicon  of  all  neighbors  within  250  km;  known  cognates  marked 
. /ety_lexicon . tsv  —  lexicon  of  all  of  this  ISO  code's  sisters 
Panlex/ 

. /manifest . tsv  -  summary  listing  of  count,  source,  quality,  license  for  all  lect  data 
. /iso-var . tsv  -  PanLex  designation  of  the  lect,  e.g.  tha—OOl.tsv 
text/  -  orthographic  data  if  available 

Scannell/  -  at  present,  only  files  from  the  An  Crubadan  project  are  supplied. 

BCP-47/  -  the  sample's  BPC-47  code 

./info. txt  -  lect  and  source  data  identification 

./urls.txt  -  sources  for  the  ngrams  and  wordlist  (texts  are  not  included) 

. /chartrigrams . txt,  . /wordbigrams . txt ,  ./words. txt  -  datasets 


Figure  7  Structure  of  the  distribution.  When  appropriate,  files  have  a  comment  that  recapitulates  source  information,  so  that 
full  sets  can  be  concatenated  from  the  root,  e.g.:  CRCL/j^jcat  'find  ./crcl  I  grep  3_segment.tsv'  >  3_segment.tsv 


IDS  /  LWT  family,  and  the  ILCAA  /  Princeton  family),  we  faced  the  opposite  problem  of  having  to 
accommodate  a  wide  variety  of  formal  and  informal  gloss  lists.  We  see  MetaGloss  remaining  as  a 
restricted  but  extensible  framework  rather  than  a  completely  controlled  standard. 

The  MetaForm  feature  analysis,  in  contrast,  converged  fairly  quickly  on  the  set  now  in  use. 
Nevertheless,  we  had  to  retain  some  notational  features  (the  “Chinese”  IPA  characters)  whose 
importance  might  not  have  been  obvious  had  we  begun  work  in  a  different  region.  Thus,  we  anticipate 
that,  say,  the  African  languages  will  call  for  both  predictable  and  perhaps  unpredictable  extensions. 

Process  management 

Finally,  we  noticed  an  interesting  degree  of  culture  clash  between  computational  and  comparative 
linguists,  both  within  our  team,  and  the  LORELEI  project  at  large. 


computational  linguists 

(mostly  comparative)  linguists 

big  data  -  need  for  large  samples 

small  data  -  need  for  high  accuracy 

noise  that  could  be  ignored 

mistakes  that  needed  to  be  fixed 

orthography,  reliance  on  source  as-is 

phonology,  need  to  modify  the  given  forms 

data-driven  methods 

analytical  methods 

anonymous  discovery  /  acquisition  of  data 

personal  relationships  with  linguists 

difficulty  recognizing  GIGO  situations 

desire  to  build  Swiss  watches 

acceptance  of  continuous  revision 

focus  on  final  publication 

if  it’s  measurable,  it’s  progress 

question  if  small  improvements  will  scale  up 

iterative  process  -  rebuild  the  data  system 

lineal-  process  -  assemble  final  components 

linguists  should  enable  better  software 

software  should  enable  better  linguists 

Table  7  Typical  gaps  in  perception  between  computational  and  comparative  linguists. 


Our  work  -  methodical  selection  and  normalization  of  representative  data  sets  -  is  typically  the 
domain  of  comparative  linguistics  and  proto-language  reconstruction;  traditionally  an  area  of  boutique  / 
handicraft  linguistics.  We  were  interested  in  finding  ways  to  industrialize  this;  not  simply  by  building 
faster  software  whose  output  would  require  less  correction,  but  by  providing  faster,  more  accurate  data 
management  by  the  linguists  -  less  “linguists  enable  software,”  and  more  “software  enables  linguists.” 

For  example,  choices  made  in  normalizing  notation  affected  automated  syllabification;  while  tweaks 
of  language  and  subbranch- specific  syllable-break  rubrics  affected  proper  recognition  of  sub- syllabic 
segments  -  which  sometimes  required  going  back  to  the  beginning  and  altering  notation.  Similarly, 
source  glosses  were  sometimes  ambiguous  in  ways  that  could  only  be  resolved  at  the  end  of  the  process, 
when  items  were  being  clustered  into  cognate  sets;  again,  initial  source  data  (glossing)  was  somewhat 
indeterminate  until  the  end  of  the  process. 

Thus,  instead  of  focusing  on  standalone  software  systems  that  would  incorporate  linguistic 
knowledge  per  se  (the  “linguists  enable  software”  approach),  we  also  wrote  tools  that  provided  myriad 
data  views  to  expose  different  kinds  of  inconsistency,  and  let  the  linguist  manage  the  development  cycle 
very,  very  quickly;  e.g.  by  immediately  seeing  the  ultimate  effects  of  early  choices  in  data  preparation, 
and  by  fixing  the  software  process,  rather  than  fiddling  with  the  end  of  the  data  pipeline.  Providing 
rapid  feedback  loops  on  the  data  life  cycle,  and  constant  willingness  to  redesign  tools  as  needed,  made 
the  difference. 
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5.  Conclusions 

This  document  summarizes  work  carried  out  by  CRCL  on  behalf  of  the  DARPA  LORELEI  project.  We 
have  described  both  the  specific  contract  deliverables  and  our  additional  activities.  All  required 
milestones  were  surpassed,  and  all  data  and  analysis  is  available  for  re-use. 

While  the  project  was  limited  to  providing  data  for  a  single  region,  we  have  shown  that  it  is  possible 
to  develop  large-scale,  fine-grained,  comparable  lexical  and  phonological  data  sets  quickly,  and  at  a 
reasonable  cost.  In  addition,  we  have  demonstrated  that  such  data  has  downstream  applications  in 
supporting  DARPA’ s  mission.  We  feel  that  an  ongoing  project  of  this  type  for  Asia-Pacific  and  other 
regions  is  both  feasible  and  desirable. 

Our  present  language  technology  situation  hardly  seems  tenable:  for  the  majority  of  world 
languages,  we  have  little  data  beyond  ISO  639-3  identifiers,  brief  prose  descriptions,  and  rough  speaker 
areas  (unfortunately,  not  defined  in  terms  of  standard  ADM  area  boundaries).  Specific  language  data 
that  would  be  useful  in  computational  applications  -  dictionaries,  grammars,  phonotactic  analyses, 
corpora  -  is  only  narrowly  available. 

Experience  shows  that  neither  the  marketplace  nor  traditional  scientific  funding  agencies  are  likely  to 
fill  this  gap.  From  the  commercial  point  of  view,  small  languages  do  not  justify  investment  costs;  their 
speakers  are  either  too  few  in  number,  or  too  poor,  even  when  they  number  in  the  millions.  From  the 
research  point  of  view  (e.g.  the  NSF-NEH  Documenting  Endangered  Languages  initiative),  funding 
tends  to  support  documentation  of  single  languages,  and  the  opportunity  this  provides  for  training  young 
linguists.  When  broader  linguistic  surveys  are  done,  they  usually  focus  on  data  of  phylogenetic  interest 
for  proto-language  reconstructions  that  involve  single  subgroups  or  families  -  not  on  on-the-ground 
reality  that  is  needed  for  computationally  useful  modeling. 

To  paraphrase  Chamfort,1  we  may  begin  by  choosing  the  most  inviting  languages,  but  in  the  end  we 
want  them  all.  LORELEI  is  one  of  a  continuing  series  of  exercises  in  developing  language  technology. 
Methods  and  goals  have  changed  in  the  decades  since  TIPSTER,  but  the  list  of  languages  of  interest 
always  gets  longer. 


1  “Most  compilers  of  anthologies  of  poetry  or  epigrams  are  like  people  eating  cherries  or  oysters:  they  start  by  picking  out  the 
best,  and  end  up  eating  the  lot.”  Nicolas-Sebastien  Chamfort,  Reflections  on  Life,  Love  and  Society  (1795). 


22 


6.  Recommendations 

We  conclude  with  recommendations  for  ongoing  work  (beyond  extending  language  coverage). 

language  identification  language  identification  based  on  trained  trigram  models  or  similar  is  extremely 
effective;  see  [Scannell  2007].  However,  we  may  not  have  substantial,  identified  text  samples  to  work 
with;  e.g.  when  the  use  of  informal  orthographies  for  online  /  text  message  communication  is 
widespread,  as  is  increasingly  the  case  for  non-roman  scripts,  as  well  as  languages  without  formal 
writing  systems.  It  would  be  useful  to  see  if  a  phono  dynamic  language  model,  based  partly  on 
recognizable  segments,  and  partly  on  the  relations,  co-occurrence  restrictions  between,  frequency, 
salience,  and  functional  load  of  arbitrary  segments,  is  sufficient  to  identify  a  language  that  relies  on  an 
unknown  orthography. 

Web  corpus  acquisition  building  text  corpora  by  Web  crawling  and  scraping  is  a  well-established 
discipline.  However,  it  does  not  address  the  problem  of  crawling  and  language  identification  absent  a 
set  of  seed  search  terms.  Nor  may  these  be  trivially  obtained  if  and  when  a  language  either  has  no 
formal  writing  system,  or  is  so  obscure  that,  say,  its  Wikipedia  page  does  not  point  to  native-language 
sources.  We  propose  that  informal  low-density  language  texts  are  likely  to  be  written  using  the  roman 
alphabet,  and  that  we  can  make  reasonable  guesses  as  to  how  our  phonologically  transcribed  data  might 
be  transliterated  by  native-language  speakers,  providing  the  necessary  seed  search  terms. 

ISO  639-3  audit  this  standard  was  adopted  in  2007,  based  on  the  then-current  edition  of  Ethnologue. 
It  is  managed  as  a  completely  separate  entity,  and  relies  on  outside  requests  for  additions,  deletions,  and 
other  changes.  ISO  639-3  does  not  document  languages  per  se;  it  points  to  outside  authorities  (at  this 
point,  only  Ethologue)  for  assistance  in  language  denotation,  i.e.  any  descriptive  information  about  the 
language,  or  its  place  among  related  languages.  Ethnologue,  in  turn,  does  not  regularly  document  the 
sources  of  its  conclusions  (and  has  recently  gone  to  a  fee-for-access  model  for  these). 

The  problematic  bottom  line  is  that  there  is  no  clear  measure  of  the  distinction  between  assigned  ISO 
codes  (languages  that  are  essentially  the  same  may  have  the  same  code),  or  of  the  tolerable  degrees  of 
divergence  with  a  single  assigned  ISO  code  (so-called  dialects  may  be  mutually  unintelligible). 
Government  decisions  that  rely  on  ISO  codes  as  a  measure  of  linguistic  diversity  may  not  be  well- 
founded.  CRCL  wordlists  -  in  some  cases,  representing  many  lects  within  a  single  “language”  -  can 
show  the  degree  of  lexical  diversity  (or  lack  thereof)  between  lects  and  languages,  and  lay  the 
foundation  for  more  reliable  measures  of  linguistic  divergence. 

lexical  item  generation  approaches  to  this  problem  include:  straightforward  machine  translation 
(phonological  segments  are  treated  as  words  in  a  sentence),  extended  MT  approaches  (e.g.  adding 
feature  bundle  information),  or  translation  by  phonological  transliteration/transduction.  Linguistically 
motivated  approaches  include  attempting  to  generate  a  parent  proto-form  first  (then  using  that  as  the 
translation/transliteration  source),  and  working  from  an  existing  proto-language  model. 

identification  or  prediction  of  nativized  loanwords  while  similar  to  the  problem  above,  this  requires  a 
separate  analysis  that  attempts  to  model  the  phonological  reduction  or  feature  insertion  typically  found 
in  loanword  acquisition  (as  opposed  to  the  regular,  lexicon-wide  patterns  of  phonological  variation 
found  in  divergent  languages). 

ortho-to-phono  CRCL  wordlists  provide  the  necessary  data  for  alignment  with  dictionary  headwords, 
based  on  a  combination  of  (raw  and  normalized)  gloss/definition  and  unambiguous  IP  A/orthographic 
correspondences.  This  should  be  sufficient  for  training  general-purpose  orthography-to-phonology  tools. 

machine-assisted  transcription  /  segmentation  automated  transcription  can  be  highly  effective  when 
trained  language  models  exist.  However,  experiments  on  adapting  available  models  to  low-resource 
languages  have  not  been  promising.  The  CRCL  wordlists  supply  the  necessary  data  for  an  attempt  to 
bootstrap  assistive  software  for  limited  cases  -  e.g.  recorded  wordlists,  which  we  can  help  locate  and 
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provide.  Similarly,  the  phonodynamic  models  we  provide  may  give  some  traction  to  simple  tasks  on 
open  audio;  e.g.  locating  word  boundaries. 

minimizing  resource  acquisition  effort  we  do  not  know  how  well  the  distribution  of  tokens  and 
segments  within  a  lexicon  models  typical  corpus  use.  Nor  do  we  know  how  large  a  subset  of  the  lexicon 
is  required  to  model  the  “full”  (say,  10,000  words)  lexicon,  or  how  to  estimate  whether  or  not  a  sample 
in  hand  is  sufficient.  We  anticipate  that  a  combination  of  Monte  Carlo  testing,  and  application  of  Zipf’s 
and  Heaps’  Laws,  would  address  the  question  of  devising  stopping  rules  for  minimally  useful  lexicon 
acquisition.  This  is  a  rather  important  question,  both  from  the  point  of  view  of  extending  any  of  our 
shorter  resources,  and  of  proposing  any  new  efforts  for  data  acquisition  (either  in  the  field,  or  from 
untranscribed  legacy  field  data). 

evidence-based  evaluation  of  Ethnologue  /  Glottolog  subgrouping  in  comparative  /  historical 
linguistic  theory,  subgroups  are  based  on  objective  shared  phonological  and  lexical  innovations. 
However,  there  is  considerable  difference  between  the  Ethnologue  and  Glottolog  analyses,  and  neither 
points  to  any  clear  analysis  of  lexical  evidence.  The  CRCL  wordlists  begin  to  provide  the  data  required 
to  generate  an  independent  subgroup  analysis  of  languages  in  Asia-Pacific  (based  on  distance  measures), 
and  to  prompt  the  development  of  tools  intended  to  specifically  identify  turning-point  innovations.  Both 
of  these  support  LORELEI  efforts  in  lexicon  extension,  language  identification,  and  other  language 
modeling  applications. 

linguistic  data  warehouse  /  workbench  apps  looking  beyond  the  front-line  performers  to  LORELEI 
tool  integration,  CRCL’s  fine-grained  coverage  of  the  Asia-Pacific  region  supports  applications  of 
interest  to  both  linguists  and  early  responders.  These  include  the  ability  to  project  linguistic  resources 
onto  local  maps,  and  to  single  out  shibboleths  -  locally  salient  phonology  or  word  forms  -  that  help 
identify  speakers. 
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Appendix  A  MetaGloss 

MetaGloss  guides  the  normalization  of  glosses.  The  notes  below  are  repeated  from  section  8.  above. 

•  when  possible,  a  single  WordNet  3.0  sense  is  provided:  house#n#l 

•  when  two  or  more  useful  interpretations  are  plausible,  they  are  pipe-separated:  bake#v#lltoast#v#l. 

•  several  word  classes  have  been  added  (with  all  items  numbered  #1):  demonstrative),  j( conjunction ), 
k (in  term),  m (odal),  p (ronoun),  q( interrogative ),  x  ( temporarily  uncategorized). 

•  when  new  senses  are  added,  they  arc  numbered  #0:  armspan#n#0. 

•  a  polysemous  sense  that  does  not  exist  in  English  is  indicated  by  labeling  the  WN  3.0  sense:  v@fist#n#l 
indicates  the  verb  sense  of  the  noun  “fist,”  i.e.  “make  a  fist.” 

•  kin  terms  arc  built  up  in  regular-  fashion,  starting  with  the  person  who  is  ultimately  referenced: 
mot.fat#k#l  is  the  mother  of  the  father,  or  the  paternal  grandmother. 

•  senses  may  have  attributes  that  help  document  what  we  believe  is  the  useful  reference  meaning;  e.g. 
carry#v#l: tumpline.  This  indicates  that  for  purposes  of  cognate  grouping  the  item  clusters  with  “carry” 
terms,  but  keeps  “tumpline”  accessible.  These  head+attribute  forms  may  be  simplified  in  the  future. 

•  classifiers  are  noted  by  the  :clf  attribute,  e.g.  basket#n#l:clf  is  a  classifier  for  baskets,  several#a#l:clf 
for  several  items,  kick#v#l:clf  is  an  instance  of  kicking.  There  may  be  some  inconsistency  in  the  listing 
of  feature-oriented  classifiers  (e.g.  long,  thin  items)  because  it  is  not  always  clear  if  the  given  form  is  a 
classifier,  or  just  an  instance  of  an  item. 

•  a  small  amount  of  ad-hoc  notation  may  be  encountered,  e.g.”!”  in  !understand#v#l  negates  the  primary 
term.  These  affect  only  a  few  items  for  which  proper  handling  is  unclear. 

It  is  important  to  remember  that  MetaGlosses  do  not  replace  the  raw  glosses.  Rather,  they  provide  an  additional 
layer  that  is  more  usable  as  an  index  to  phonological  forms  in  many  languages  -  an  index  that  points  to  the  forms 
that  are  most  likely  to  be  genetically  related,  but  still  respects  semantic  variation  between  lects. 

Appendix  B  MetaForm 

MetaForm  guides  the  normalization  of  raw  phonological  transcription.  Basic  guidelines  are  simple: 

•  Standard  IPA  is  always  used  with  the  exception  of  these  characters:  /  7  i/mv  /,  which  may  be  found  in 

the  phonological  features  table. 

•  Source  notation  that  appears  to  indicate  minor  phonetic  variation,  and  may  hinder  useful  lect  comparison, 
is  suppressed. 

•  Syllable  boundaries  are  always  marked. 

•  Raised  characters  are  either  diacritics  (e.g.  indicating  aspiration)  or  secondary  features  according  to  our 
analysis  of  the  syllable. 

•  A  fully  tokenized  form  relies  on  three  separator  characters;  note  that  tie  characters  are  not  used: 

o  I  (x00A6  /  &#166;)  separates  the  onset,  nucleus,  coda,  and  tone  sections 

o  :  separates  the  core  and  post-core  sections  of  the  onset  and  coda.  A  pre-core  is  possible,  but  not 
currently  used. 

o  .  separates  bound  features  from  the  pre-core  and  post-core,  and  vowels  within  the  nucleus. 

o  I  separates  syllables. 

•  A  segmented  form  uses  .  to  separate  phonological  segments. 

•  Some  ambiguity  and  inconsistency  are  tolerated;  particularly  in  handling  of  minor  syllables. 

Like  MetaGloss,  MetaForm  cannot  entirely  replace  the  raw  transcribed  forms.  Again,  they  help  to  provide  an 
additional  layer  that  serves  as  the  most  probable  common  index  of  features  shared  within  and  between  languages. 
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Appendix  C  Phonological  features 


height 

backness 

place 

manner 

high 

i 1  y  i  *  a  ui  u  \ 

front 

i'yepECEeoeqqyq 

bilabial 

pb6Pp$$bpmB 

nasal 

m  ip  n  p.i>ji  p  n 

near-high 

I  Y  u 

near-front 

I  Y 

labiodental 

pf  bv  f  ip  v  v 

stop 

pbtdtcj.Jd’CjkgqG?? 

close-mid 

e090TO 

central 

i‘n9ea9QEa3 

dental 

0  a  te  dd 

implosive 

mid 

3  9  E 

near-back 

u 

alveolar 

tdcftscksznrjrli^tid^l 

affricate 

pi])  bp  pf  bv  t0  dS  ts  tf  ck  tp 
cfe  tl  dfe  t§  dz,cq  jj  kx  gy  qx  gk 

open-mid 

8  3  oe  G  A  D 

back 

UIUYOADQDU 

retroflex 

t  z;.tg  dz^iLt  -ll 

fricative 

$Pfv0asz§zj3??q,ix 

YXKh¥hfiib 

near -low 

c6  e 

palatoalveolar 

approximant 

wjqj  iqql[f  XI 

low 

a  a  d 

alveolopalatal 

tap-flap 

v  r  t  -1 

palatal 

cjfqcqjjijijijX 

trill 

B  r  R  H  ¥ 

labiopalatal 

q 

velar 

kgtfkxgyxypiql 

labiovelar 

W 

uvular 

qGXnqx®NRtf 

pharyngeal 

?  h  9  h  ¥ 

glottal 

?hfi 

class 

role 

position 

length 

pre-articulation 

consonant 

onset 

pj~Q 

epenthetic 

9  i  i 

prenasalized 

m  n  r\ji  r)  n 

vowel 

nucleus 

core 

short 

X 

devoiced 

XX 

syllabic 

X 

coda 

post 

long 

: 

preglottalized 

? 

minor 

1.1  1.2  2.2  1.3  2.3  3.3  1.4  2.4  3.4  4.4 

preaspirated 

h 

prelabialized 

w 

prestopped 

b  d  J  0 

realization 

phonation 

post-articulation 

rounded 

HHuy0oeneGa3OUY 

nasal 

X 

nasalized 

m  n  i\ji  r)  n 

voiced 

m  it)  n  r|.i>  ji  n  b  d  ^  j  g  g  ?  bp  bv  dS  d;  ck 
dfe  dz^j  gyPvSzzL3jyK9wr.if.|Jiim 
lt5AllI-lBRH?GK?cLfi 

aspirated 

h 

glottalized 

? 

retroflexed 

I'M. 

devoiced 

XX 

palatalized 

j 

lateralized 

Hferidfe  JIAH 

breathy 

labialized 

w 

fricated 

1M1M.U 

creaky 

X 

labiopalatalized 

nonvocalized 

dental 

X 

stopped 

b  d  j  0 

prevocalized 

raised 

X 

velarized 

~XY 

vocalized 

lowered 

X 

pharyngealized 

X1 

rhotic 

X" 
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Appendix  D  File  formats 

Files  discussed  below  exemplify  the  full  distribution. 

When  appropriate,  the  /Ethnologue  path  and  files  are 
paralleled  by  a  /Glottolog  set  (and  may  be  expanded  to 
other  analyses).  Below,  the  #File:  line  (giving  the  path) 
is  not  part  of  the  file.  Commented  lines  in  bold  text  are 
column  labels. 

#File:  crcl/paths . txt 
#File:  crcl/geo/inf o . geo 
#File:  crcl/geo/CN/inf o . geo 
#File:  crcl/geo/CN/Yunnan . geo 
#File :  crcl /metagloss /metagloss . txt 
#File:  crcl/metagloss/new . txt 
#File:  crcl/metagloss/kin . txt 
#File:  crcl/metagloss/n . txt 
#File :  crcl/ cognates/ setByRow . tsv 
#File :  crcl /cognates/ setByCol .tsv 

#File :  crcl /AA/ alk/huf f man 1 971 vocabulary . cl2/2_segment .tsv 

#File :  crcl /AA/ alk/huf f man 1 971 vocabulary . cl2/3_segment_nuc .tsv 

#File :  crcl /AA/ alk/huf f man 1 971 vocabulary . cl 2/ cover .tsv 

#File :  crcl /AA/ alk/huf f man 1 971 vocabulary . cl 2/ contrast .tsv 

#File :  crcl /AA/ alk/huf f man 1 971 vocabulary . cl 2 /density .tsv 

#File :  crcl/AA/ alk/huf fmanl 971 vocabulary . cl 2 / load . tsv 

#File :  crcl/AA/alk/ inf o /Ethnologue /geo_di stance .tsv 

#File :  crcl/AA/alk/ inf o /Ethnologue /ety_di stance .tsv 

#File :  crcl/AA/alk/ info /Ethnologue/ ety_lexicon .tsv 

#File :  crcl/AA/alk/ inf o /Ethnologue /geo_lexi con .tsv 

#File :  crcl /AN/mak/ text/ Scannell/mak-Latn/info . txt 

#File :  crcl /AN/mak/ text/ Scannell/mak-Latn/urls . txt 

#File :  crcl /AN/mak/ text/ Scannell/mak-Latn/ chart ri grams . txt 

#File :  crcl/AN/mak/text/ Scannell/mak-Latn/wordbigrams .txt 

#File :  crcl/AN/mak/text/ Scannell/mak-Latn/words .txt 

#File :  crcl/ cognates/ cognates . xml 

#File :  crcl/ cognates /etygloss/ able#a#l/AA: SI 17 9 .xml 

#File:  crcl/paths . txt 

#path 

crcl 

crcl/paths . txt 
crcl/hadr 
crcl /metagloss 

Paths  to  all  files. 

#File:  crcl/geo/inf o . geo 

#bibref  column 

tryonl 995comparative  80 
109.3571 

hudak2 0 0 8comparat ive  1 
14 . 7368, 100 . 5249 
hudak2 0 0 8comparat ive  5 
16.1155,102.9990 
hudak2 0 0 8comparat ive  8 
18 . 3471, 99.7262 

The  top-level  list  only  provides  Ethnologue  data  because 
it  has  slightly  better  ISO  639-3  coverage.  Latitude  and 
longitude  are  typically  4-digit  reals,  and  reflect  the 
location  of  the  populated  place  nearest  to  the  lat, long 
figure  we  license  from  SIL,  and  which  cannot  be 
released. 


-+ 


ISO 

country 

ADM— 1 

lat, long 

rap 

Chile 

-27 . 1248, 

tha 

Thailand 

Changwat 

Lop  Buri 

tts 

Thailand 

Changwat 

Maha  Sarakham 

nod 

Thailand 

Changwat 

Lampang 

#File:  crcl/geo/CN/inf o . geo 
#bibref  column  ISO 

huangl 992tbl  10  pmi 
huangl 992tbl  11  jya 
huangl992tbl  12  ero 
huangl992tbl  13  qvy 


country  ADM-1  lat ,  long 

China  Sichuan  Sheng  27.9014,101.5165 
China  Sichuan  Sheng  31.7580,102.2552 
China  Sichuan  Sheng  30.8187,101.8259 
China  Sichuan  Sheng  30.3193,100.8392 


These  have  the  same  format  as  the  top-level 
crcl/geo/info.geo  file.  The  country  code  is  the  two-letter 
ISO  3166-1  alpha-2  abbreviation.  The  summary 
info.geo  file  is  provided  because  ADM-1  codes  cannot 
always  be  identified  for  a  given  lat, long  value  (e.g.  if  it 
happens  to  fall  in  open  water).  We  expect  to  resolve 
these  over  time. 


#File:  crcl/geo/CN/Yunnan . geo 


#bibref  column 

ISO 

country  ADM-1 

lat , long 

huangl992tbl  20 

duu 

China 

Yunnan 

27 . 9801, 98 . 4442 

huangl992tbl  28 

acn 

China 

Yunnan 

24 . 6798,  98 . 7253 

huangl 992tbl  29 

acn 

China 

Yunnan 

24 . 6798, 98 . 7253 

huangl992tbl  30 

atb 

China 

Yunnan 

24 . 4029, 98 . 3244 

These  have  the 

same 

format  as 

the 

top-level 

crcl/geo/info.geo  file,  and  describe  the  current  ADM-1. 


#File :  crcl /metagloss /metagloss . txt 

#metagloss  count 

a  592 

d  11 

j  7 

k  159 


The  metagloss.txt  file  summarizes  the  POS-specific 
files;  however,  they  split  all  a\b  forms  into  the  individual 
words  (which  may  have  different  POS). 


#File:  crcl/metagloss/new . txt 

#metagloss  count  explanation 

a_little#n#0  38 

among#r#0  30 

armspan#n#0  121 

armspan#n#0 : around  41 

The  top-level  list  only  provides  Ethnologue  data  because 
it  has  slightly  better  ISO  639-3  coverage.  Latitude  and 
longitude  are  typically  4-digit  reals,  and  reflect  the 
location  of  the  populated  place  nearest  to  the  lat,long 
figure  we  license  from  SIL,  and  which  cannot  be 
released. 


#File:  crcl/metagloss/kin . txt 

#A11  kin  term  components  in  use 
.BY.  address  term:  a.BY.b 
Post -modi tiers 

: addr  general  address  term 

This  file  documents  the  construction  of  kin  terms  in 
MetaGloss. 


32 


#File:  crcl/metagloss/n . txt 


#POS 

count 

l#n#l 

37 

Adam' s_apple#n#2 

35 

Allium#n#l 

5 

April#n#l : lunar 

36 

This  particular  file  lists  all  noun  forms  that  appear  in 
MetaGloss.  Other  x.txt  POS  files  are  similar: 
a:adjective,  d:  demonstrative,  j:conjunction,  k:kin, 
m:modal,  n:noun,  p:pronoun,  q: interrogative,  r:adverb, 
v:verb,  x:unassigned 


#File :  crcl/ cognates/ setByRow . tsv 

#count  EtySet  cogset 

arnaudl9971exique . c3  . . . 

7  Allium#n#l  HM:R599 

9  Allium#n#l  HM:R835 

15  Hmong#n#l  HM:R73 

17  I#p#l  AA :  2 


arnaudl 9 971exique . cl 


Each  EtySet  is  the  rough  gloss  of  a  historical  form, 
while  each  cogset  includes  related  terms  from  modern 
languages,  given  in  the  appropriate  cell  (most  cells  are 
empty).  We  expect  there  to  be  at  least  one  cogset  per 
family.  Cogsets  are  named  either  by  a  reference  to  the 
literature,  or  by  an  arbitrary  number  associated  with  the 
family.  Over  time,  both  cogsets  and  etysets  will  cluster 
into  larger  groupings  of  genetically  related  forms. 


arnaud!9971exique . c2 


#File :  crcl/cognates/ setByCol .tsv 


tcount 

source 

Hmong#n# 1 1 HM : R7  3 

220 

arnaudl 9971exique 

10 

391 

arnaudl 9971exique 

11 

369 

arnaudl 9 97 lexique 

12 

262 

arnaudl 9 97 lexique 

14 

ISO  Allium#n#l|HM:R599 

npy 

sda 

mqj 

rog 


The  setByCol  view  labels  each  column  with  an 
EtySetlcogset  pair.  The  count  gives  the  number  of 
items  from  a  particular  source  have  been  assigned  to 
cogsets.  These  items  appeal-  in  the  table  cells  (most  are 
empty).  Over  time,  cells  will  contain  more  forms  as 
cognate  sets  are  first  developed  following  current 
semantics,  then  joined  to  account  for  semantic  shift  and 
borrowing. 


Allium#n#l  |  HM:  R835 


#File :  crcl/AA/alk/huf fmanl 971vocabulary . cl2/2_segment .tsv 

#huffman!971vocabulary  12  AA  alk  Guibian  Zhuang 


<  k  90 

h  >  88 

g  >  87 

<  t  78 


Segment  bigrams  and  counts.  Pre-  and  post-word 
boundaries  are  shown  with  <  and  >.  The  first  line  gives 
the  table  contents:  bibref  and  column,  family,  ISO  639- 
3  code,  and  ISO  language  name. 
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#File :  crcl/AA/alk/huf fmanl 971 vocabulary . cl2/3_segment_nuc . tsv 
#huf fmanl 97 lvocabulary  12  AA  alk  Guibian  Zhuang 

<  k  a  42 

<  p  a  31 

<  t  a  28 

<  ph  a  26 

Segment  bigrams,  as  above,  except  that  the  complete 
nucleus  (diphthongs  and  longer)  is  treated  as  a  single 
segment.  Other  3_...  files  are  similar,  with  content 
as  per  file  name. 

#File :  crcl/AA/alk/huf fmanl 97 lvocabulary . cl 2/ cover .tsv 

#huf fmanl971vocabulary  12  AA  alk  Guibian  Zhuang 

#64  letters,  31  words 

#  aa:  bcchdee:  fhii:  j  k  kh  kw  1  m  m  n  n  o  o :  pphrrwstthuu:  w  g  g  °  g_  a  o  o :  e 
e:  e  e:  i  i:  i  ji  ji  1  ?j  ?1  ?r  ?w  mb  mp  mph  ^k  ^kh  ^c  ^ch  nt  nth 
prug  tip  kasok  grave#n#2 

thalu:p  thane:j  clothing#n#l 

Minimum  cover  set.  Line  1  describes  the  source  and 
language.  Line  2  gives  the  number  of  distinct 
phonological  segments,  and  the  size  of  the  minimum 
cover  set.  The  remainder  of  the  file  consists  of  (tab- 
separated)  words  and  their  glosses. 

#File :  crcl/AA/alk/huf fmanl 97 lvocabulary . cl 2/ contrast .tsv 
#huf fmanl 97 lvocabulary  12  AA  alk  Guibian  Zhuang 


a 

a : 

kat 

ka :  t 

k#t 

f rom#r#0/only#a#l 

burn#i#3 

a 

a : 

paj 

pa:  j 

P#j 

three#n#l 

rice#n#l : cooked 

a 

a : 

phat 

pha :  t 

ph#t 

grass#n#l 

chew#v#l 

a 

a : 

tap 

ta :  p 

t#p 

stab#v#2 

slap#v#l 

Minimum  contrast  set.  The  columns  show  the  two 
contrasting  segments,  the  words  each  appears  in,  and  a 
joint  form  with  #  in  the  common  slot.  The  final  columns 
have  the  metaglosses  of  the  two  contrasting  words. 

#File :  crcl/AA/alk/huf fmanl 97 lvocabulary . cl2 / density .tsv 
#huf fmanl 97 lvocabulary  12 

#Clustering  coefficient  (2Nv/Kv (Kv-1) )  Links  (Nv)  Degree  (Kv)  word  neighbors 

0.7778  28  9  ca : 

co: | ja: I ka : |ma: I na : I ra : I ta : |tha: |mpha: 
0.3611  13  9  maj 

mat  |  mar)  |  ma?  I  mo  :  j  I  paj  I  saj  I  7a  j  I  ^kaj  I  J*caj 
0.3333  12  9  paj 

par  |  pat  I  par)  |  pap  I  pa  :  j  I  saj  I  ?aj  I  nkaj  I  J>caj 

Each  list  of  neighbors  differs  from  the  target  word  by  a 
single  phonological  segment.  Kv  is  the  number  of  these 
neighbors.  Nv  is  the  number  of  neighbors  that  are  one 
segment  away  from  each  other.  The  clustering 
coefficient  is  in  the  range  0  ..  1,  and  gives  a  sense  of 
how  tightly  bound  the  neighborhood  is. 

#File :  crcl/AA/alk/huf fmanl 97 lvocabulary . cl 2 /load .tsv 

#huf fmanl 971vocabulary  12  AA  alk  Guibian  Zhuang 

#segment  contrst  total  load 

a  35  378  0.0925 

a:  29  101  0.2871 

b  11  6  1.8333 

Segment  bigrams,  as  above,  except  that  the  complete 
nucleus  (diphthongs  and  longer)  is  treated  as  a  single 
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segment.  Other  3_...  files  are  similar,  with  content 
as  per  file  name. 


#File :  crcl/AA/ alk/ inf o/Ethnologue/ geo_di stance . tsv 

#ISO  analysis  0-100  101-200  201-300  301-400  401-500 

alk  Ethnologue 

llo : 10 | oyb : 15 | irr : 17 | ngt : 2  6 | spu : 32 | lbo : 41 | skk : 4  9 | tto : 4  9 | nev : 51 | kuf : 54 | tth : 64 | tgr : 74 | kgd: 8 
0 | oog: 80 | jeg: 83 | kgc : 85 | sqq: 97 

stg: 103 |pac: 107 | hid: 112 |brb: 113 |phg: 114 | ktv: 116 | brv: 121 | tdf : 121 | jeh: 137 | krv: 151 |hal : 151 | t 
kz : 154 |bru: 158 | rmx: 163 | ren: 169|sed:174| xkk: 192 | tdr : 195 | kta: 196  I cua: 199 

kxy : 209 | xhv : 216 | moo : 217 | krr : 221 | tpu :228 | sss:237 | hre :240 | jra:242 | yoy : 245 | nuo :252 | ny 1:255 | s 
cb : 25  6 | bdq : 2  61 | pcb :267|skb:287| kdt : 2  95 | pkt : 2  96 

rka : 305 | uan : 312 | aem: 312 | nyw : 314 | cmo : 32 1 | pht : 327 | vie : 34  0 | rad : 343 | thm: 34  3 | bgl : 354 | kxm : 3  62 | h 
ro : 365 | tmp :369 |bfk:384 |lso:385|tts:390 | tpo : 393 

mng : 406 | khm: 4  07 | mnn : 4  07 | hnu :412 | cja:417 | sti : 4 18  I tnu : 419 | huq: 422 | stt :427 |tyj:437 |tou:439|r 
og: 443 | cma : 455 | kpm: 463 | cuq: 469|lic:474|cje:479|syo:483|jio:483|tyh:488| tmm: 489 | the: 492 | lao: 495 


Each  row  is  labeled  with  the  current  ISO-639-3  code  and 
the  source  (Ethnologue  or  Glottolog)  of  the  language 
position  points.  The  remainder  of  the  row  has  five  tab- 
separated  groups  of  ISO:distcince  pairs,  each  I  separated. 
Distances  are  in  kilometers,  using  single-point  language 
locations;  these  are  progressively  less  meaningful  as  the 
size  of  the  speaker  community  increases.  A  future 
release  will  attempt  to  take  national  or  regional 
languages  into  account,  regardless  of  their  point 
distance.  ISO  codes  are  only  given  for  languages  we 
have  data  for. 


#File :  crcl/AA/ alk/ inf o/Ethnologue /ety_di stance .tsv 
alk  Ethnologue  alk | tpu 

alk | brb | bru | brv | cbn | cog | irr | jeh | kdt | kgc | kgd | khm I k jg | ktv | kuf | lbo | lep | mlf | mnw | ngt | oog | pac | p 
cb | sss | sti | tdf | tpu | tth | tto 

alk | brb | bru | brv | cbn | cog | irr | jeh | kdt | kgc | kgd | khm I k jg | ktv | kuf | lbo | lep | mlf | mnw | ngt | oog | pac | p 
cb | sss | sti | tdf | tpu | tth | tto 

Each  line  has  three  tab-separated  groups  of  ISO  codes 
that  share  the  same  parent,  grandparent,  and  great- 
grandparent;  note  that  some  groups  may  be  identical. 

Ethnologue  data  is  (for  the  moment)  suspect  due  to 
problems  in  properly  identifying  some  parent  levels. 

Only  groups  with  <=50  ISO  codes  are  reported,  because 
a  single  early-branching  survivor  (common  in 
Austronesian)  may  include  the  entire  family  as  first 
cousins.  ISO  codes  are  only  given  for  languages  we 
have  data  for.  The  Glottolog  analysis  tends  to  have 
more  branches  /  smaller  groups. 

Each  ASJP  line  lists  ISO  codes  and  the  NDLD 
(normalized  Levenshtein  distance  divided)  from  the 
current  ISO  code  [Bakker  et  al  2009].  A  maximum  of 
50  codes  are  provided.  In  some  cases  a  distance  from 
the  current  ISO  code  (to  itself)  may  be  reported;  this 
occurs  when  the  ASJP  dataset  had  multiple  lect  samples. 
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#File :  crcl/AA/alk/ inf o/Ethnologue/ety_lexicon . tsv 

#sources  gloss  [alk]  huf fmanl971vocabulary  12 

[tpu]  huf fmanl971vocabulary  16  ... 

2  Idau#k#l  c.a.w 

k.i.m.n.a.n  k.a.n 

3  If at#k#l  t . a : 

p  h  .  i  .  0  n  0  .  j 

3  Imot#k#l  j . a . 1 

m .  a .  e  . "?  n  a .  j 

2  Ison#k#l  c.a.w 

A  lexicon  of  sister  languages  according  to  a  specific 
subgroup  analysis  Trees  for  Ethnologue  and  Glottolog 
are  similar.  Each  row  is  labeled  with  the  number  of  lects 
that  have  forms  for  the  gloss  in  the  second  column.  All 
entries  in  each  sister-language  lexicon  is  included; 
however,  some  rows  may  just  have  a  single  form  entry. 


[alk]  theraphan20011anguages_2  4 


t .  a : 
j  .a.1? 

p .  a .  s  .  a  :  .  w 


#File :  crcl/AA/alk/ inf o/Ethnologue/geo_lexicon . tsv 

#sources  gloss  [AA:alk:0]  huf fmanl971vocabulary  12  [AA:alk:0]  theraphan20011anguages_2  4 

[AA:irr:17]  huf fmanl979vocabulary  1  ... 

2  ! understand#v#l  c.o: .m 


3  Careya_arborea#n#0  k.a.d.o:.n 

3  Caryota#n#l  t.a.j.ui.g 

4  Hypericacaea#n#l  h .  a .  q .  i .  a .  g 

h .  a  .  q  .  i  .  a  .  g 

A  lexicon  of  lects  whose  point  locations  are  within 
1 00km  of  each  other.  Trees  for  Ethnologue  and 
Glottolog  are  similar.  NB:  this  is  a  wide  table;  the  data 
values  shown  here  are  for  illustrative  puiposes  only. 


#File :  crcl/AN/mak/text/Scannell/mak-Latn/info . txt 


ISO  630-3 

BCP-47 

glottocode 

name 

country 


mak 

mak-Latn 

makal311 

Makasar 

Indonesia 


(Sulawesi) 


The  Scannell  info  file  summarizes  the  per-BCP-47  code 
information  he  provides. 


#File :  crcl/AN/mak/text/ Scannell/mak-Latn/urls . txt 
http : / / incubator .wikimedia . org/wiki/Wp/mak/ Gowa 
http : / / incubator .wikimedia . org/wiki/Wp/mak/Main_Page 
http : / / incubator .wikimedia . org/wiki/Wp/mak/Persigowa_Gowa 
http : / / incubator .wikimedia . org/wiki/Wp/mak/PSM_Mangkasara%27 
http : //www .bible . is/toc?version=MAKLAI&language=Makassar 

Scannell  source  file;  shows  links  to  his  data  sources. 


#File :  crcl/AN/mak/ text/ Scannell /mak-Latn/ chart ri grams . txt 
ang  25834 
ng>  15144 
ri>  15101 
na>  13937 
<an  12413 


Scannell  source  file;  contains  character  triples  and 
counts.  <  and  >  indicate  word  boundaries. 
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#File :  crcl/AN/mak/ text/ Scannell/mak-Latn/wordbi grams . txt 
.  \n  10623 
mae  ri  2290 
,  "  1820 
.  "  1021 
"  \n  1015 

Scannell  source  file;  these  are  space-separated  token 
bigrams  and  counts. 


#File :  crcl/AN/mak/text/ Scannell/mak-Latn/words .txt 
ri  11702 
an jo  5001 
siagang  3687 
ke ' nanga  3110 
Allata'ala  3031 

Scannell  source  file;  these  are  space-separated  tokens 
and  counts. 


#File :  crcl/ cognates/ cognates . xml 
<document  version=" 1 . 0 "> 

<cognate  id="huf fmanl971vocabulary : C : cl3 . r625 . gs204 1 . i8527 "  iso639-3="lbo"  lang="Laven"> 
<etygloss>roast#v#l</etygloss> 

<cogset>AA: S2041</ cogset> 

<f orm>buh</f orm> 

</ cognate> 

<cognate  id="huf fmanl971vocabulary : C : c9 . r625 . gs2041 . i852 6"  iso639-3="kdt "  lang="Kuy"> 
<etygloss>roast#v#l</etygloss>  <cogset>AA: S2041</ cogset>  <f orm>buh</formx/cognate> 

<cognate  id="huf fmanl97  9vocabulary : C : clO .pi 9-2 9 . rl471 . ill 948 "  iso639-3="sss"  lang="So"> 
<etygloss>roast#v#l</etygloss>  <cogset>AA: S2041</ cogset>  <f orm>buh</formx/cognate> 
ccognate  id="huf fmanl979vocabulary : C : ell .p21-2 9 . rl474 . il477 6"  iso639-3="tto"  lang="Lower 
Ta'oih">  <etygloss>roast#v#l</etygloss>  <cogset>AA: S2041</cogset>  <form>boh</f ormx/cognate> 

The  complete  set  of  cognate  entries.  The  cognate  tag 
encapsulates  each  entry,  with  attributes  id  (consistent 
across  all  data),  an  iso639-3  code,  and  the  formal  ISO 
lang  language  name.  The  etygloss  gives  a  rough 
historical  semantic  label;  each  cogset  numbers  a  cognate 
set.  The  form  (like  the  attributes)  are  included  for 
convenience,  and  can  be  recaptured  from  the  main 
dataset.  NB:  The  entry  has  been  indented  for  display. 


#File :  crcl/ cognates /etygloss/able#a#l/AA: SI 17  9 . xml 
<document  version=" 1 . 0 "> 

ccognate  id="huf fmanl971vocabulary : C : cl . r39 . gsll79 . i641 "  iso639-3="khm"  lang="Central 
Khmer "> 

<etygloss>able#a#l</etygloss> 

<cogset>AA: S1179</cogset> 

<form>ba : n</form> 

</ cognate> 

Identical  to  the  same  item  in  the  complete  cognate  set, 
above. 
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Appendix  E:  Languages  of  Disaster 


CRCL  proposes  to  build  a  resource  that  locates,  enriches,  and  ties  language  and  GIS  data  to 
humanitarian  assistance  /  disaster  relief  (HA/DR)  event  histories.  It  will  support  applications  for 
responding  to,  and  predicting  or  pre-provisioning,  disaster  events.  We  attempt  to  balance  today’s 
desire  for  an  interactive  sandbox  with  tomorrow’s  probable  request  for  machine  access  to  data 
for  re-use  and/or  re-implementation.  This  document  describes  the  project’s  goals,  content,  and 
development  issues.  An  initial  proof  of  concept  can  be  found  at 

http://sealang2.net/proiect/lorelei/over. 

Introduction 

The  DARPA  LORELEI  project  is  based  on  the  observation  that  language  information  is  integral 
to  effectively  detecting,  directing,  and  delivering  HA/DR  assistance.  Most  of  the  current 
research  effort  frames  the  issue  from  the  point  of  view  of  response  that  involves  given  target 
languages:  how  can  we  most  effectively  analyze  communications  in  a  particular  language  in  a 
disaster  situation? 

We  extend  this  by  considering  the  issue  from  the  point  of  view  of  both  response  to  and 
anticipation  of  events.  Given  an  impending  disaster,  what  geographic  areas  and  speaker 
communities  will  be  affected?  Given  a  history  of  disaster  characteristics  (frequency,  duration, 
extent,  impact),  as  well  as  understanding  of  language  distributions  and  relations,  can  we  predict 
not  only  what  areas  and  communities  a  particular  kind  of  disaster  will  effect,  but  also  what 
languages  might  be  most  usefully  pre-provisioned?  This  information  is  helpful  to  both  users  and 
providers  of  LORELEI  capability. 

We  also  consider  the  problem  from  the  distinct  viewpoints  of  LORELEI  1.*  performers,  and  the 
analysts  who  are  our  ultimate  downstream  consumers.  For  example  the  performer  wants  to 
know  the  likely  source  of  loan  words  into  a  target  language;  the  analyst  want  to  know  what 
language(s)  a  random  person  in  an  arbitrary  city  is  likely  to  speak.  The  performer  wants  an 
aggregate  model  that  helps  in  machine-based  language  identification,  while  the  analyst  needs  to 
know  likely  forms  for  “hungry”  within  a  10-mile  radius. 

A  secondary  goal  of  the  project  is  to  make  the  somewhat  inchoate  mass  of  language-relevant 
information  more  discoverable  and  comprehensible.  We  want  to  be  able  to  instantly  answer  such 
questions  as:  is  MT  technology  available  for  a  given  language?  What  is  the  most  similar 
language  that  has  either  MT,  or  substantial  data  resources?  Are  text  samples  available?  If  not, 
what  wider  language  of  communication  is  likely  to  have  influenced  a  given  language’s  writing 
system?  What  related  and  unrelated  languages  inhabit  the  same  general  geographic  area,  and 
what  are  their  relative  speaker  numbers? 

Considerable  work  has  been  done  on  each  of  this  problem’s  three  major  aspects:  linguistics, 
geodata,  and  HA/DR  data.  Unfortunately,  we  cannot  produce  a  useful  tool  simply  by  mashing 
datasets  together;  there  are  non-trivial  problems  to  solve  in  both  harmonizing  and  extracting 
actionable  information  from  the  data.  By  the  same  token,  even  given  harmonized,  mashable 
data,  it  is  not  instantly  clear  what  the  most  effective  ways  to  articulate  queries  and  display  results 
should  be.  We  built  the  proof-of-concept  website  to  explore  this  question. 
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Design  principles 

Our  first  premise  is  that  any  one  or  more  of  three  basic  parameters  -  languages,  HA/DR  events, 
and  geographic  areas  -  should  be  able  to  serve  as  a  search  key  for  any  of  the  others.  This  is 
achieved  by  indexing  each  data  set  in  terms  of  one  or  more  ADM-1  top-level  administrative 
areas,  typically  provinces  or  states.  Thus,  the  ADM-1  is  the  common  key  to  all  data."  Implicitly, 
features  of  any  one  set  link  to  the  others  via  the  ADM-1;  e.g.  a  date  range  implicitly  links  to 
languages  effected  by  HA/DR  events  that  fall  in  that  range,  and  effect  those  speaker  areas. 

Second,  we  want  to  be  able  to  aggregate  results  whenever  possible.  A  data-driven  choice  of  a 
high-value  investment  language  -  that  is,  a  language  that  should  arguably  be  pre-provisioned  - 
depends  on  understanding  not  only  its  similarity  to  related  languages  and  the  availability  of 
existing  resources,  but  also  the  expected  impact  of  future  disasters  on  speaker  communities 
which  might  benefit.  We  cannot  assume  that  well-provisioned  national  languages  (such  as  Thai 
or  Vietnamese)  will  fill  this  role,  incidentally  -  their  very  success  (and  the  integration  of  foreign 
influences  this  usually  implies)  often  makes  these  languages  poor  examples  of  the  family  or 
branch  as  a  whole. 

Third,  we  try  to  anticipate  and  enable  any  logical  needs  for  follow-through  /  drill  down  /  loop 
back.  For  example,  a  query  into  events  that  have  affected  a  region  will  return  mentions  of 
countries  and  languages.  The  natural  drill-down  is  to  click  on  one  of  these  countries  or 
languages  in  order  to  see  what  events  have  impacted  it.  Then,  we’re  likely  to  want  to  loop  back 
-  click  on  an  event  to  re-use  it  as  a  new  starting  query  -  because  it  delimits  a  region  or  set  of 
languages. 

Fourth,  we’re  interested  in  what  might  be  called  analytical  imagery.  The  demo  site  shows  some 
simple  examples  of  how  weighting  can  be  used  to  render  maps  that  may  help  clarify  unseen 
relations,  such  as  the  contrast  between  the  number  of  events,  and  their  impact  in  terms  of 
population  and  speaker  community  numbers. 

Finally,  we  want  to  expose  data ,  and  not  just  analyzed  results,  in  support  of  decision  making. 
Our  goal  is  not  to  replace  the  analyst,  but  rather  to  provide  all  available  information,  allowing 
alternative  views  of  single  data  sets,  and  comparison  of  alternative  data  sets.  We  also  want  to 
allow  drill-down  into  any  of  the  language  /  event  /  geographic  axes,  e.g.  a  visual  interface  may  be 
useful  for  discovery,  but  a  lexical  dataset,  list  of  languages  by  city,  or  contemporaneous  news 
reports  might  ultimately  be  most  useful  to  the  analyst.  Thus,  we  anticipate  providing  machine 
access  to  data. 

Data  sources  are  discussed  in  more  detail  below,  but  briefly: 

•  disaster  data  is  taken  from  the  EM-DAT  and  GLIDE  datasets, 

•  GIS  data  is  from  the  GADM  shapefile  sets,  which  attempt  to  cover  all  five  ADM  levels 
worldwide,  and  GeoNames.org,  which  has  the  best  vernacular  and  informal  name 
information, 

•  linguistic  data  is  from  a  variety  of  sources:  subgrouping  from  Ethnologue,  Glottolog,  and 
ASJP,  MT  availability  from  our  own  survey  of  Google,  Bing,  and  Yandex  resources,  base- 
level  resource  availability  from  GlottoDoc,  corpus  availability  from  An  Crubadan, 

•  secondary  data  is  inferred  whenever  possible. 


2  This  also  turns  out  to  be  an  effective  granularity  from  the  linguistic  perspective  -  ADM-1  boundaries  are  not 
necessarily  arbitrary  political  boundaries;  rather  they  often  delimit  geographic,  ethnic,  and  linguistic  areas. 
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The  current  implementation  takes  a  few  shortcuts.  For  example,  the  GLIDE  data  is  only  roughly 
integrated  (it  will  ultimately  be  tied  to  EM-DAT,  which  has  much  better  geographic  extent  data). 
And  we  use  each  language’s  nominal  center  point  to  identify  a  single  ADM-1  entity  (in  fact,  it 
might  be  spoken  in  several).  We  can  sometimes  mitigate  these;  for  example,  GLIDE  data  can  be 
roughly  aligned  by  incident  date,  and  a  country’s  national  language(s)  can  be  assigned  to  every 
ADM-1. 

Functionality  and  use  cases 

To  varying  degrees,  CRCL’s  lover  website  provides  the  following  types  of  information  and 
functionality: 

•  resource  availability  for  all  7,100  living  languages  per  the  ISO  639-3  standard, 

•  resource  availability  within  LORELEI, 

•  impact  of  disasters  on  speaker  communities, 

•  the  likely  national  and  regional  second/third  languages  for  each  speaker  community, 

•  the  'nearest'  (per  Ethnologue/Glottologue)  relative  that  has  tools  or  large  data  resources. 

•  the  condition  and  reliability  of  state-of-the-art  disaster  and  speaker  data. 

•  various  maps  that  show  weighted  event  distributions, 

•  a  summary  of  event  types  and  languages  affected, 

•  analysis  of  likely  high-value  investment  language  candidates. 

Typical  use  cases  for  E*  LORELEI  developers  include  identification  of: 

•  suitable  incident  languages,  which  have  an  appropriate  mix  of  population  and  existing 
resources. 

•  high-value  investment  languages  -  those  that  are  directly  or  indirectly  the  target,  fallback, 
or  pivot  for  -  high-risk  regions, 

•  languages,  regions,  and  dates  of  known  past  events,  which  may  be  used  to  help  model  and 
recognize  on-line  “disaster  chatter.” 

Finally  from  the  analyst’s  perspective,  we  can  explore: 

•  impact  of  past  events  of  the  same  type  in  the  same  area, 

•  speaker  communities  likely  to  be  affected,  and  their  populations, 

•  languages  likely  to  be  used/understood  in  each  city  in  the  area, 

•  language  resources  available, 

•  most  likely  broad  language(s)  of  communication, 

•  external  reports  linked  to  the  EM-DAT  or  GLIDE  identifiers  (not  yet  implemented), 

•  a  set  of  HA/DR  query  terms  for  each  language  (e.g.  CRCL’s  HA/DR  parallel  lexicon  sets), 

•  ideally,  a  model  of  historical  "disaster  chatter"  esp.  in  languages  that  are  not  currently 
modeled  or  discoverable  (I  like  the  HA/DR  lexicon,  but  I'm  not  convinced  that  it  can 
properly  seed  for  or  identify  all  relevant  online  data). 

Disaster  Resources 

The  primary  disaster  resources  are  EM-DAT  and  GLIDE.  Both  provide  numbers  that  identify 
event  type  and  date.  A  separate  number  is  issued  for  each  country;  i.e.  a  single  event  may  have 
multiple  numbers. 
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GLIDE  The  Global  Identifier  Number  system  was  developed  by  the  Asian  Disaster  Reduction 
Center  (ADRC).  It  includes  6,259  event  references  (http://glidenumber.net).  GLIDE  supplies 
somewhat  longer  text  descriptions  of  the  events,  as  well  as  a  single  latitude  /  longitude  point 
(derivation  is  unclear). 

EM-DAT  The  Emergency  Events  Database ,  produced  by  the  Centre  for  Research  on  the 
Epidemiology  of  Disasters  (CRED).  “EM-DAT  contains  essential  core  data  on  the  occurrence 
and  effects  of  over  22,000  mass  disasters  in  the  world  from  1900  to  the  present  day.  The 
database  is  compiled  from  various  sources,  including  UN  agencies,  non-governmental 
organizations,  insurance  companies,  research  institutes  and  press  agencies.”  (http://emdat.be) 

EM-DAT  supplies  a  text  description  of  each  numbered  event’s  area.  In  theory  this  is  an 
administrative  area  as  specified  by  GAUL  (discussed  below),  but  in  practice  locations  are  given 
as  a  mix  of  formal  and  informal  names.  In  some  cases,  EM-DAT  also  provides  estimates  of  the 
financial  impact,  number  of  deaths,  and  number  of  people  affected  by  each  event. 

Both  GLIDE  and  EM-DAT  numbers  are  sometimes  cited  in  other  databases.  However,  regular 
citation  (a  la  ISBN  numbers)  is  not  common. 

Shortcomings  GLIDE  and  EM-DAT  are  not  cross-linked.  Because  they  do  not  always  record 
events  as  occurring  in  the  same  time  or  place,  they  will  require  a  combination  of  machine  and 
hand  alignment.  While  GLIDE’S  lat/long  points  are  helpful  for  obtaining  a  quick  visual 
overview  of  events  in  a  region,  they  given  no  indication  of  the  actual  extent  of  any  event.  EM- 
DAT  does  a  much  better  job  of  listing  affected  areas;  however,  the  public  dataset  does  not 
normalize  these  names  to  GAUL  ADM-1  names.  Again,  we  can  do  quite  a  bit  of  heavy  lifting 
by  machine,  but  hand  alignment  will  also  be  required. 

As  noted,  the  EM-DAT  impact  estimates  are  incomplete.  We  will  provide  parameters  for 
estimating  the  blanks  by  using  known  relations  between  cost/death/affected  figures,  and  between 
known  impacts  and  event  types. 

Geo  Data  Resources 

Primary  resources  are  listed  here.  We  rely  on  GADM,  with  additional  support  from  GeoNames. 

GADM  The  Global  Administrative  Areas  project  provides  shapefiles  for  all  five  ADM  levels  for 
all  countries.  It  currently  has  data  for  294,430  administrative  areas.  This  is  the  best  open 
shapefile  source,  and  has  reliable  ADM  identification,  (http://gadm.org) 

GeoNames  This  is  the  most  extensive  set  of  place  names  and  equivalents  available.  “The 
GeoNames  geographical  database  ...  contains  over  10  million  geographical  names  and  consists  of 
over  9  million  unique  features  whereof  2.8  million  populated  places  and  5.5  million  alternate 
names.  All  features  are  categorized  into  one  out  of  nine  feature  classes  and  further 
subcategorized  into  one  out  of  645  feature  codes.”  (http :// geonames .org/about.html) .  Lat/long 
points  are  provided  for  each  item. 

GNS  The  Geographic  Names  System  (US  National  Geospatial-Intelligence  Agency)  set  is  the 
US  standard.  It  only  includes  point  information  for  ADM-1  entities. 
(http://geonames.nga.mil/gns/html/) 

GAUL  The  Global  Administrative  Unit  Layers  dataset  is  prepared  by  the  United  Nations  /  FAO. 
It  includes  shapefiles  for  ADM-1  and  ADM-2  entities.  It  is  not  publicly  available,  however, 
there  is  a  released  crosswalk  to  GNS.  See 
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http://www.fao.org/geonetwork/srv/en/metadata.  show?id=  12691  and 

http://blog.gdeltproiect.org/global-second-order-administrative-divisions-now-available-from- 

gaul/ . 

Shortcomings  The  resources  above  reflect  the  distinctive  primary  concerns  of  their  developers, 
and  it  is  probably  better  to  think  in  terms  of  each  set’s  strengths  rather  than  its  weaknesses. 
GeoNames  is  extremely  helpful  for  indentifying  non-standard  and  vernacular  names,  but  names 
may  be  missing,  or  under- specified  (in  term  of  ADM  category).  GADM  has  excellent  coverage 
of  formal  names,  and  has  both  points  and  polygons,  but  is  not  sufficient  for  identifying  place 
names  found  in  the  wile. 

As  noted  above,  because  place  naming  in  EM-DAT  is  somewhat  irregular,  normalizing  its 
combination  of  (usually)  ADM-1  and  ADM-2  names  to  GADM  will  require  a  combination  of 
machine  and  hand  work. 

Language  Data  Resources 

Subgrouping  To  determine  language  similarity  globally  we  rely  on  Ethnologue,  Glottolog,  and 
ASJP  (the  Automated  Similarity  Judgment  Project).  All  use  ISO  639-3  codes  for  language 
indexing;  however,  Glottolog  rejects  some  of  these  and  maintains  a  parallel  set  ( glottocode )  of 
finer- grained  lect-by-lect  identifiers.  Ethnologue  and  Glottolog  provide  roughly  the  same  family 
and  subgroup  analyses;  however,  Glottolog  tends  to  split  (and  Ethnologue  tends  to  lump)  lower- 
level  sub-branches.  ASJP  does  not  provide  a  branch  analysis  per  se;  rather,  one  can  build  a  table 
of  distance  measures  for  all  languages. 

This  is  an  area  in  which  data  and  methodology  from  CRCL  and  other  LORELEI  performers 
should  be  able  to  make  a  significant  improvement.  The  Ethnologue  and  Glottolog  analyses  are 
based  on  (sometimes  idiosyncratic)  interpretations  of  what  constitutes  a  significant  phonological 
innovation;  this  does  not  always  speak  to  similarity  from  the  point  of  view  of  machine  translation 
or  language  identification.  ASJP  uses  tiny  (40-item)  sets;  these  may  distinguish  the  major  family 
and  branch  splits,  but  are  less  effective  at  finer  levels. 

Machine  Translation  We  list  the  open  access  tools  provided  by  Google,  Bing,  and  Yandex, 
including  development  languages,  as  a  proxy  for  the  availability  of  “advanced”  language 
technology  resources.  These  languages  probably  have  other  necessary  resources  (text  and  bitext 
corpora,  dictionaries)  available. 

Text  Corpora  As  noted  above,  available  MT  resources  usually  predict  corpus  availability  for 
the  major  languages.  For  the  other  98%,  Scannell’s  An  Crubadan  is  believed  to  be  the  broadest 
corpus  set  known. 

Demographic  Data  We  license  the  Ethnologue  18  dataset.  This  provides  speaker  number 
approximations,  and  details  regarding  each  language’s  official  status  (which  is  helpful  for 
inferring  secondary  languages  of  communication).  Speaker  area  data  is  based  on  Ethnologue; 
see  e.g.  http://langscape.umd.edu/map.php.  We  will  not  distribute  any  shapefile  data. 
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Proof-of-concept 

An  initial  proof  of  concept  can  be  found  at  http://sealang2.net/project/lorelei/over.  It 
demonstrates  most  of  this  proposal’s  ideas,  but  still  requires  work  in  various  areas: 

•  documenting  website  functionality, 

•  aligning  the  GLIDE  and  EM-DAT  event  numbers, 

•  revising  the  EM-DAT  location  data  to  reflect  precise  ADM  entities, 

•  obtaining  city  and  ADM- 1 -level  data  on  language  distribution, 

•  parameterizing  measures  that  estimate  missing  death,  damage,  and  affected  population 
figures, 

•  improving  the  current  quick-and-dirty  similarity  measures  used  to  identify  pivot  languages, 

•  adding  mouse  functionality  to  the  map  displays, 

•  providing  additional  functionality  for  summarizing  historical  events  by  language(s)  and 
vice  versa, 

•  linking  CRCL’s  “small  lexicons,”  and  very  large  set  of  HA/DR  parallel  lexicons,  to  the 
interface, 

•  identifying  and  providing  click-through  access  to  other  external  data  sources  that  are 
accessible  via  EM-DAT  and/or  GLIDE  numbers. 


Annotated  screen  captures  follow,  below. 
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Complete  browser:  menu,  center  top,  center  bottom,  maps 


CRCl  ISO  639-3 )  Resource  I HA/OR  Event  Overview 

rate  uioa  (MOAT  •  (Mich  mart 

Show  ^  '  mmatmmn  «r  ci«* 

mmmom*  4DCf»  •  I—«M  I-  IH4J 

l-WX  WOO'  -  »16» 

Outtu  (Mi-imii  On  '  -  0  «  » 


SeMtonimmjuM  On  ’On  • 

Show  tnien  ••  '  wflhatntfcai  wl)  (•«  a<  OMOW  •ornca; 
Show  comm  «t  new*  wWi  w*<mem  •  'meuca  -6Wle» 

Able*:  IM  AnvtfkM  I 064  EutopoTM  AtiaPaolK  Mil 


About  tbe  CRCL  ISO  6394  t  Resource  /  HA/DR  Event  Overview 

This  is  an  internal  CRCL  web  page  Please  do  not  imk  lo  or  distribute  tins  URL  All  tour  trames  may  be  reseed 

Click  Eastern  Africa  and  search  loll  lo  bogm  Click  HMM£v*nf  overview,  upper  left,  to  reload  Sms  frame 

NB:  This  is  u  lui-scate  proof-of-concept  Please  expect  occasion*  lutbulence  from  noisy  source  data  and  temporarily  loterale  ibul  tell  me  abouli  any  bugs 

This  tool  roquror.  .1  large  screen  lor  better  yet  two  screens)  II  prnvxles  rough  pictures  of 
resource  avotetxlity  for  ol  7.100  Lvng  languages  per  the  ISO  639-3  standard. 

-  resource  avotabety  wither  LORELEI 
impact  of  disasters  on  speaker  commursbos 

-  the  likely  notional  and  regional  sacond'third  languages  lor  each  speaker  communAy 
the  neeiesr  lpo<  E thnoiogue  GloiSotogue )  relativo  lhal  has  roots  or  targe  data  resources 
the  condfeon  and  rekabiity  of  stato-of-lhe-ait  disaster  and  speaker  data 

-  various  maps  lhal  show  weighted  event  distributions 
analysts  of  likofy  lugh-valuo  nvoctnumi  language  candidates 

It  will  support  accurate  kJonhbcnton  of 

suitable  incident  languages  which  have  an  appropriate  mat  of  populobon  and  existing  iosoixcos 

-  high-value  investment  languages  -  those  that  are  directly  or  indirectly  the  target  fallback  or  pvof  for  -  high-risk  regions 
languages,  rogont,  and  dates  of  known  past  events  which  may  bo  used  to  hofp  modol  and  recognize  on-line  'disaster  chntw 

This  beta  release  matches  top  level  adnumslralrve  areas  (as  interred  from  pom  data  lor  languages  and  sane  formal  descriptive  irsls  tor  disasters)  lo  ahgn  speakei  communities  and  HA'DR  events 

A  Mure  release  will  use  actual  speaker  areas  and  pra-normakza  the  HA  OR  event  data  (apparently  beyond  Ihe  level  repotted  in  Cred  Crunch  43)  and  wd  be  considerably  more  accurate 


Languages  of  Disaster 

Alter  compteling  a  search  CtCk  on  any  language  »  country  n  the  table  above  lo  see  assoented  wenls 
Sources 

EM-Oat  disaster  event  data  is  from  emdat  be  (downloaded  9-26-2016  •  9-26-2016)  AJI  regions  and  non-technotogicai  events  are  included  lor  me  period  1990  •  2016  (their  last  update  was  July  19  2016) 
D  Guhn-Snpo  R  Below  Ph  Hoyws  -  EM-OAT  The  CREOOFOA  Inlemabonal  Disaster  Database  www  emdat  bo  Umverwte  Cathobque  de  Louvain  Brussels  Belgium 
Emnologue  18  data  is  tcensed  from  tl*e  Ethnoiogue  Global  Dataset  Eighteenth  edition  data  M  Paul  Lewis  Gary  F  Simons,  and  Charles  D  Fenrag.  Editors 

Giottoiog  2.6  data  is  taken  from  Hamnvarstrom  Harold  6  Fork*  Robed  6  Haspelmath  Matin  6  Bank  Sebasbon  2016  Gtettotog  2  6  leipzp  Max  Ptanck  instaute  tor  Evolutionary  Anthropology 
http  -  giottoiog  org  accessed  20 IS- 10436  CC-BY -SA  3  0 
GeoNamee  data  queried  from  GeoHames  org,  2015-12-13  CC-BY  3  0 
Glide  data  queued  from  gMonumbar  net  (downloaded  9-28-2016  10-4-2016) 

CRCL  dotn  from  soatang2  net  project  toroteidown 


Languages  of  Disaster 

Maps  wif  appear  here  after  every  search  Map  height  and  Ihe 
width  of  this  frame  can  be  adjust od  at  upper  left  (all  homes  can 
also  be  dragged  to  re  size) 

For  the  moment  maps  can  bo  zoomed  or  panned  but  there  are 
no  hover  or  cfck  events  These  heatmaps  should  render 
almost  inslantaoiisf/  btw  -  they  ora  much  faster  lhan  the 
sortobio  tables  n  (he  center  frames 

PKiaso  kit  me  know  4  you  have  any  ideas  tor  additional 
'analytical'  mapping  1  e  not  simply  mappng  tor  position  but 
using  weighting  to  help  foreground  lesv  than -obvious  relatnns 


Below  language  counts  and  national  populations  from 
Ethnologuo  18  Countries  that  do  not  have  'native'  languages 
do  not  show  language  or  population  counts  ADM- 1  counts  are 
intoned  from  GADM  2  8 


XAD  Afeyohti  A  crwkeea 
ALA  Anna 
ALB  Abont* 

OZA  Atgtna 

ASM  A jnercan  Samoa 

AND  Andorra 

AGO  Angola 

ATG  Antigua  &  Baibusa 

ARC  Aigcntma 

ARM  Armenia 

AUS  Auvsaha 

AUT  Aussie 

AZt  Azertxetan 

BMS  Bahamas 

BMft  Bahran 

BG0  Bangladesh 


21472  000 
90,000 
41  660  COO 
3.022.000 


9.417  000 
368  000 
1.276,000 


Above,  the  initial  site  view.  The  capture  below  was  taken  after  a  single  query,  selecting  only  “Myanmar”  in  the  menu  on  the  left. 
Note  that  the  center  frame  has  separate  top  and  bottom  portions  -  the  top  contains  a  sortable  table,  and  the  bottom  has  a  fixed  table, 
followed  by  a  sortable  table. 


CRCL  ISO  639-3  I  Resource  /  HA/DR  Event  Overview 
Find  •mole  EM-OAT «  2016-0232  [  Search  reset 

Show  V  map!  '  (able  *  type*  V  Mmumnli  *  GUOE 
Map  height*  -lOOpx  *  Frame  wMIha  j|*=l|j  |H-f| 

Time  period  1900  '  -  -  2016  ' 

Oeathe (min.-mu)  On  '  On  * 

Attactad (min..nuu)  On  »  -  On  » 

Apply  restrictions  to  single  •  cumulative  events 
Speaker! Imin.max)  On  >  'On  ♦ 

Show  sisters  all  '  with  resources  only  l ere  all  disabled  for  now) 
Show  cousins  ell  none  with  resources  •  rresources'-sislers 


Middle  Arnco  1676  -  Anooei22i  Cameroon  |J37) 

Central  Aft  e  an  Pepubt:  •  55  -  Chad  106-  Congo  iJT, 

Democratic  Peptfik:  ot  Me  Congo  1177)  EqueMnei  Gueiea  (61 
Gabon  < SI i  S6o  Tome  e  Princee  Or 


Northern  Artice|97 1  Algeria  I)  Ejrpti7  LOyeiSl 

Morocco  16 1  Sudan  162 1  Time*  |2| 

Southern  Africa  148 1  BoM»anj<l5>  Lesotho  ill 


7102  languages  seen  90  matched,  in  about  9  seconds  Showing  event  data  for  1900-2016  only  Mouse  over  any  column  head  for  details  click  to  re-sorl  (shift+click  for  multiple  sort  columns)  Click 
any  Language  or  Country  to  see  related  events 


rank  (Me)  t  pop  e  ISO  e  language  a  region  a  country  a  AOM 1  a  national  a  regional  a  pivot  a  tamty  a  LOC  a  MT7  a  Giono  a  orth  a  CRCL  •  events  a  dead  a  anocied  a  SmH  a 

37 it t*i  32035300  mva  Burmese  SEAsia  Myanmar  Wagway  myar  mya-  Sino-Tibetan  G  l:DWC/o:G  AC  HLY1  4  306  9,331  103  S177 

Region  Ste:P  Y3? 

191(31*1  3295000  *HN  Shan  SE  Asia  Myanmar  Shan  Slate  myar  shn  Etm:  lao  i  tha  re  Tal-Kadal  l:DWC)o:gS  AC  HLY1  9  292  9248.885  S122 

Y3? 

269|4i*i  1800000  rhg  Rohlngya  SE  Asia  Myanmar  Rakhlne  myar  rhg  Glo;  ben  ra  Indo-  HL  12  609  9.698  525  S686 

State  European 

301(5*i  1480000  Ktw  Karen.  S  gaw  SE  Asia  Myanmar  Bago  myar  ksw  Sino-Tibetan  l:D  wC)o:G  AC  ML  Y1  4  138  387  2.472  400  S4.000 

Region  S/p:p  Y3’ 

371(6*i  1050000  kjf  Karen.  Pwo  SE  Asia  Myanmar  Kayin  State  myar  k)p  Sino-Tibetan  l:wc/«:0  7  138  513  11 677  700  $4,119 

Eastern  s-p:p 

391i6*i  1000000  ikt  Rakhlne  SEAsia  Myanmar  Rakhlne  myar  rhg  <fco.-myar  Sino-Tibetan  l:wc's:»:p  12  609  9.696.525  S686 

State 

408(6*i  940000  kac  Jingpho  SEAsia  Myanmar  Kachm  myar  kac  Sino-Tibetan  l:WC/*:G  AC  ML Yt  8  226  9.295.943  S119 

State  S/e:P 

i  v.  Mon  i  ASK  Myanmar  <a\  in  State  i  -Austro-  t:D  w  c<a:6  Ml  Yt  t  - 


Click  any  language  or  country  in  the  table,  above,  to  see  associated  events  Initial  analysis  of  returned  results  follows  in  the  two  separate  tables  below 

Event  types  by  number  of  languages  affected  and  number  of  events  per  language  (People  per  event  type  per  language  can  t  be  calculated  yet ) 
f Storm  85  j  Chinbon  Chin  [aibj~4  \  Mro-Khimi  cmn  [ernrj  4  |  sumiu  Chin  [csv]  4  Chak  [ckh]  4  |  Rohmgya  [mg]  4  |  Raknine  [rki]  4 

Flood  82  ;  Rawang  [raw]  6  \  Rakhlne  [rki]  6  |  Mon  ]mnw]  6  Lashl  [Isi]  6  j  Kachm  ]kac]  6  j  Daai  Chin  [oao]  6 

Landslide  *62 ]  Thaiphum  Ctim[cth]2  !  Tedim  cnm[etd]2_  [  Tawt  Chin |tcp|  2  ~MUn Chin Imwql 2  Rawngtu  Cbm  [weu] 2  \  Knumi  Cmn  [enk]  2 

I  earthquake  45 1  Shwe  Paiaunq  loin  2  [Shah|shn|2 _ I  iritha  limi  2 _ ~  Zavein  Karen  fkxkl  2 1  Ymchia  |ymi  2 _ [  Danu  Idnvi  2 

The  table  below  estimates  investment  language  rank  and  benefit  Four  distinct  roles  are  considered  national  language  (which  include  both  statutory  and  de  facto  languages)  regional  language 
which  is  the  most  widely  spoken  language  in  a  given  ADM-1  area  pivot  sister  which  is  the  closest  relative  with  sizeable  technical  or  data  resources,  and  pivot  cousin  which  can  fill  in  for  a  missing 
sister 

Every  language  may  have  as  many  as  four  roles  All  figures  shown  for  each  language  reflect  only  the  events  for  which  it  is  counted  as  a  national  regional,  sister  or  cousin  language 
NB:  Scroll  the  table  up  to  the  top  of  the  frame,  then  dick  (or  shift+dick)  column  heads  to  sort 


pop 

32,035,300 
344  000 
3  295,000 
100,100 
1  800.000 
940  000 
1050.000 
150.000 
32,035.300 


e  ISO*  Imke  *  family 


•  LOC  *  MT7 


1  National  Burmese 

2  Regional  Tedim  Chin 

2  Regional  Shan 

2  Regional  Tase  Naga 

2  Regional  Rohlngya 

2  Regional  Kachln 

2  Regional  Pwo  Eastern  44 

2  Regional  Western  Kayah 

2  Regional  Burmese 


mya  90  Sino-Tibetan 

eta  20  Sino-Tibetan 

shn  18  Tai-Kaoat 

nst  16  Sino-Tibetan 

rhg  8  Indo-European 

kac  7  smo-Tibetan 

kjp  6  Sino-Tibetan 

kyu  6  Smo-Tibetan 

mya  3  Smo-Tibetan 


G 


*  drill  •  CRCL 

AC  MLY1Y37 

AC 

AC  HLY1Y37 

HLY1 
HL 

AC  MLY1 

AC 

AC  HLY1Y3? 


*  evend  *  dead  e  a  Heeled  *  Smll  * 

73  417  742  81  715  825  S14  084 

9  247  9  296.843  $119 

9  292  9.248.885  $122 

6  332  9  040,410  $121 

12  609  9  696.525  $686 

8  226  9.295.943  $119 

7  138.513  11.677  700  $4  119 

2  138  383  2  420.300  $4  000 

4  306  9.331.183  S177 


Below,  language  counts  and  national  populations  from 
Ethnologue  18.  Countries  that  do  not  have  "native"  languages 
do  not  show  language  or  population  counts.  ADM-1  counts 
are  inferred  from  GADM  2.8. 


Abb  ? 

country  ▼ 

ISO  p 

ADM-1  ? 

Population  » 

CHN 

China 

252 

31 

1,357,380,000 

IND 

India 

381 

37 

1,252,140,000 

USA 

United  States 

181 

52 

313,914,000 

IDN 

Indonesia 

684 

34 

248,818,000 

BRA 

Brazil 

178 

27 

193,947,000 

PAK 

Pakistan 

56 

8 

184,350,000 

NGA 

Nigeria 

484 

37 

172,713,000 

BGD 

Bangladesh 

15 

7 

156,591,000 

RUS 

Russia 

91 

85 

143,856,000 

JPN 

Japan 

14 

47 

127,339,000 

MEX 

Mexico 

274 

32 

118,395,000 

PHL 

Philippines 

175 

81 

98,394,000 

ETH 

Ethiopia 

78 

11 

94,101,000 

VNM 

Vietnam 

75 

65 

89  709  000 

Prior  to  the  query,  the  map  area  contained  the 
list  of  country  names,  ISO  639-3  code  counts, 
ADM-1  top-level  administrative  entities,  and 
populations  seen  at  left.  This  list  has  been  re¬ 
sorted  by  clicking  on  the  Population  cell. 
While  there  are  obvious  exceptions,  ADM-1 
areas  have  reasonably  consistent  granularity 


Query  specification.  The  upper  portion  focuses  on  event  types,  while  the  lower 
portion  allows  specification  of  geographic  areas  and/or  language  families  and 
branches  (large  portions  of  the  center  and  bottom  of  the  menu  captures  have  been 
snipped). 

These  can  be  extended  to  refer  to  any  aspect  of  the  underlying  data,  and  are 
implement  as  REST  calls. 


CRCL  ISO  639-3  /  Resource  /  HA/DR  Event  Overview 

Find  single  EM-DAT  #  Search  reset 

ShOW  0  maps  0  table  0  types  0  investments  0  GLIDE 
Map  heights  400px  0|  Frame  widths  |  |  — —  |  |  |  |  |  —  |  "  | 

Time  period  1900  v|  -  [  2016 

Deaths  (min,  max)  0,.n  v,|-[  0..n  v| _ 

Affected  (min.. max)  |  0..D  v|-|0..n  *>/ | 

Apply  restrictions  to  O  single  ®  cumulative  events 

Speakers  (min. .max)  |  0..H  v|  -  |  0,.n  v  | 

Show  sisters  all  with  resources  only  (are  all  disabled  for  now) 
Show  cousins  all  none  with  resources  +resources/-sisters 

0  Africa  2,138  □  Americas  i,064  □  Europe  286  □  Asia-Pacific  3,613 

Region  and  country  ( families  below) 

Africa 

0  Eastern  Africa  (431)  0  Burundi  (1)0  Comoros  (3)  0  Eritrea  (6) 

0  Ethiopia  (78)  0  Kenya  (53)  0  Madagascar  (12)  0  Malawi  (10) 

0  Mauritius  (2)  0  Mayotte  (2)  0  Mozambique  (30)  0  Rwanda  (1) 

0  Reunion  (1)0  Seychelles  (1)  0  Somalia  (8)  O  South  Sudan  (49) 

0  Tanzania  (107)  0  Uganda  (34)  0  Zambia  (25)  O  Zimbabwe  (8) 

0  Middle  Africa  (676)  0  Angola  (22)  0  Cameroon  (237) 

0  Central  African  Republic  (55)  O  Chad  (106)  0  Congo  (37) 

0  Democratic  Republic  of  the  Congo  (177)  0  Equatorial  Guinea  (8) 

O  Gabon  (31)  0  SaoTome  e  Principe  (3) 


Family  Some  small  families  with  low  speaker  numbers  are  not  shown. 

Africa  0  Niger-Congo  (1524)  01  Afro-Asiatic  (366)  0  Nilo-Saharan  (199) 

0  Khoe-Kwadi  (12)  0  Kx*a  (4) 

Asia  O  Austronesian  (1223)  0  Sino-Tibetan  (453)  O  Austro-Asiatic  (169) 

O  Tai-Kadai  (94)  0  Hmong-Mien  (38)  0  Dravidian  (84)  0  Japonic  (12) 

O  Turkic  (39)  O  North  Caucasian  (33)  0  Mongolic  (13)  0  Tungusic  (11) 

0  Kartvelian  (5)  0  Koreanic  (2) 

Melanesia  /  Oceania  O  Trans-New  Guinea  (476)  O  Australian  (201 ) 

O  Torricelli  (57)  □  Sepik  (55)  O  Ramu-Lower  Sepik  (32)  O  Tor-Kwerba  (24) 
0  West  Papuan  (23)  O  South-Central  Papuan  (22)  0  Lakes  Plain  (19) 

O  Border  (15)  0  East  Geelvink  Bay  (12)  O  South  Bougainville  (9) 

O  East  Bird’s  Head-Sentani  (8)  O  East  New  Britain  (6) 
o  Central  Solomons  (4)  □  North  Bougainville  (4)  O  Maybrat  (2) 

Europe  0  Indo-European  (437)  O  UraSc(37) 
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Center  Top,  after  query  “Myanmar” 


7102  languages  seen,  90  matched,  in  about  9  seconds.  Showing  event  data  for  1900-2016  only.  Mouse  over  any  column  head  for  details,  click  to  re-sort  (shift+click  for  multiple  sort  columns).  Click 
any  Language  or  Country  to  see  related  events. 


rank  (%ile)  * 

pop  4 

ISO* 

language  * 

region  * 

country  * 

ADM  1  * 

national  * 

regional  * 

pivot  * 

family  * 

LDC  * 

MT?* 

Giotto  * 

orth  * 

CRCL  * 

events  * 

dead  * 

affected  * 

$mil  * 

37  (1%) 

32035300 

MYA 

Burmese 

SE  Asia 

Myanmar 

Mag  way 
Region 

myaT 

myaT 

Sino-Tibetan 

G 

l:d  wc/g:g 
s/p:p 

AC 

HL  Y1 
Y3? 

4 

306 

9,331,183 

$177 

191  (3%) 

3295000 

SHN 

Shan 

SE  Asia 

Myanmar 

Shan  State 

myaT 

shn 

Eth:  lao  t  tha  tb 

Tai-Kadai 

l:dwc/g:gs 

AC 

HLY1  1 
Y3? 

9 

292 

9,248.885 

$122 

269  (4%) 

1800000 

RHG 

Rohingya 

SE  Asia 

Myanmar 

Rakhine 

State 

myaT 

rhg 

Glo:  ben  tb 

Indo- 

European 

HL 

12 

609 

9,696,525 

$686 

301  (5%) 

1480000 

KSW 

Karen.  S'gaw 

SE  Asia 

Myanmar 

Bago 

Region 

myaT 

ksw 

Sino-Tibetan 

l:dwc/g:g 

s/p:p 

AC 

HLY1 

Y3? 

4 

138.387 

2,472.400 

$4,000 

371  (6%) 

1050000 

KJP 

Karen.  Pwo 
Eastern 

SE  Asia 

Myanmar 

Kayin  State 

myaT 

kjp 

Sino-Tibetan 

l:wc/g:g 

s/p:p 

7 

138,513 

11,677,700 

S4.119 

391  (6%) 

1000000 

rki 

Rakhine 

SE  Asia 

Myanmar 

Rakhine 

State 

myaT 

rhg 

Glo:  my  a  t 

Sino-Tibetan 

l:wc/g:s/p:p 

12 

609 

9.696.525 

S686 

408  (6%) 

940000 

KAC 

Jingpho 

SE  Asia 

Myanmar 

Kachin 

State 

myaT 

kac 

Sino-Tibetan 

l:wc/g:g 

s/p:p 

AC 

HL  Y1 

8 

226 

9,295,943 

$119 

430  (7%) j 

851000 

mnw 

Mon 

SE  Asia 

Myanmar 

Kayin  State 

myaT 

kjp 

Glo:  khm  t  srb  t 
vie  tb 

Austro- 

Asiatic 

l:dwc/g:g 

s/p:p 

HL  Y1 
Y3? 

7 

138.513 

11.677.700 

S4.119 

446  (7%) 

805700 

prk 

Wa.  Parauk 

SE  Asia 

Myanmar 

Shan  State 

myaT 

shn 

Eth:  khm  t  srb  t 
vie  tb 

Austro- 

Asiatic 

l:d  wc/g:g 
s/p:p/t:t 

9 

292 

9,248.885 

$122 

539  (8%) 

563960 

ahk 

Akha 

SE  Asia 

Myanmar 

Shan  State 

myaT 

shn 

Sino-Tibetan 

l:dwc/g:g 

s/p:p 

AC 

9 

292 

9.248.885 

$122 

540  (8%) 

560740 

blk 

Pa'o 

SE  Asia 

Myanmar 

Shan  State 

myaT 

shn 

Sino-Tibetan 

l:w/p:p 

HL 

9 

292 

9,248.885 

$122 

Above,  the  response  to  a  query  “Myanmar”.  Each  row  shows  a  single  language.  All  columns  are  sortable,  and  support  shift+click 
for  secondary  sort  keys. 

•  7102  seen,  90  matched  The  total  number  of  ISO  639-3  codes  considered  (7,102),  and  found  in  Myanmar  (90). 

•  rank  the  relative  position  of  this  language  among  all  7,100  languages,  sorted  by  speaker  population.  The  (x%)  gives  its 
percentile  ranking. 

•  pop  speaker  population,  per  Ethnologue  1 8 

•  ISO,  language  ISO  639-3  code  and  formal  language  name.  The  ISO  that  has  the  largest  speaker  population  is  shown  in  small 
caps  to  indicate  that  it  is  a  good  candidate  to  be  a  language  of  communication,  and/or  to  provide  a  model  for  orthography. 

This  cell  is  actionable.  When  clicked,  the  lower  center  from  shows  details  of  all  events  that  affected  speakers  of  this  language. 

•  region,  country  the  world  is  divided  into  conventional  regions:  with  numbers  of  languages,  the  top-level  regions  are  Africa 
(2,138),  Americas  ( 1,065),  Europe  (286)  and  Asia-Pacific  (3,613).  Each  region  is  then  subdivided;  e.g.  Africa  into  Eastern, 
Western,  Northern,  Middle,  and  Southern.  Each  sub-regain  can  then  be  specified  by  country. 

The  country  cell  is  actionable.  When  clicked,  the  lower  center  from  shows  details  of  all  events  that  affected  all  areas  of  this 
country. 
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•  ADM-1  the  top-level  administriative  district  associated  with  the  language. 

•  national,  regional  these  are  ISO  codes  of  the  country’s  national  language(s),  and  the  nominal  regional  language  -  the  highest- 
population  language  in  the  current  ADM-1.  A  T  indicates  availability  of  machine  translation  technology,  while  B  and  M 
indication  that  “big”  and  “medium”  amounts  of  other  data  (grammars,  dictionaries,  corpora)  exist.  For  example,  in  Indonesia, 
the  national  language  is  Indonesian,  but  a  minority  language  like  Javanese  or  Sunda  may  be  the  language  of  education  in  a 
given  province.  A  third,  local  language  is  often  spoken  at  home. 

•  pivot,  family  a  pivot  language  is  the  language  that  is  most  likely  to  be  useful  as  an  intermediate  translation  tool,  assuming  that 
it  has  resources.  This  cell  lists  the  current  language’s  immediate  sisters  (in  roman)  or  cousins  (in  italic),  per  Ethnologue  and/or 
Glottologue.  The  same  T  B  M  code  shows  resource  availability.  The  family  is  the  conventional  name  of  the  language  phylum. 

•  LDC,  MT?,  Giotto,  orth,  CRCL  all  show  resource  availability.  LDC  and  CRCL  indicate  data  sets  and  delivery  years;  HI. 
means  that  a  HA/DR  lexicon  is  available  from  CRCL.  MT  refers  to  Google,  Bing,  Yandex,  or  GoogleDevelopment.  The 
Giotto  codes  indicate  a  “best  guess”  as  to  the  availability  of  basic  print  resources:  Lexical,  Dictionary,  Wordlist, 

Comparative,  Grammar,  (Full  or  Sketch),  Phonology,  or  Text.  This  helps  distinguish  between  (somewhat)  documented  and 
(mostly)  undocumented  languages.  Note  that  as  a  rule,  none  of  these  resources  are  in  e-form.  Finally,  Orth  indicates  that  an 
e-corpus  sample  is  available  via  An  Crubadan. 

•  events,  dead,  affected,  $mil  These  cells  summarize  all  events  that  have  affected  the  current  row’s  ADM-1  (not  the  current 
row’s  language).  We  assume  that  affected  equals  10*dead  if  no  value  is  give,  but  do  not  attempt  to  estimate  costs.  For  the 
moment,  we  do  not  divide  the  effects  of  events  over  multiple  ADM- Is,  so  there  will  be  some  overcounting. 
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Center  Bottom,  after  query  “Myanmar” 


Click  any  language  or  country  in  the  table,  above,  to  see  associated  events  Initial  analysis  of  returned  results  follows  in  the  two  separate  tables  below. 
Event  types  by  number  of  languages  affected  and  number  of  events  per  language.  (People  per  event  type  per  language  can't  be  calculated  yet. ) 


Storm 

85 

Chinbon  Chin  Icnbl  4 

Mro-Khimi  Chin  fcmrl  4 

Sumtu  Chin  Icsvl  4 

Chak  fckh]  4 

Rohinqva  Irhql  4 

Rakhine  Irkil  4 

Flood 

Landslide 

Earthquake 

82 

Rawanq  [rawl  6 

Rakhine  [rkil  6 

Mon  Imnwl  6 

Lashi  llsil  6 

Kachin  fkacl  6 

Daai  Chin  fdaol  6 

62 

Thaiphum  Chin  [cth]  2 

Tedim  Chin  fctdl  2 

Tawr  Chin  [tcp]  2 

Mun  Chin  [mwq]  2 

Rawnqtu  Chin  [weu]  2 

Khumi  Chin  fcnkl  2 
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Shwe  Palaunq  [pill  2 

Shan  [shnl  2 

Intha  Tint!  2 

Zayein  Karen  [kxkl  2 

Yinchia  [yinl  2 

Danu  fdnvl  2 

The  table  below  estimates  investment  language  rank  and  benefit  Four  distinct  roles  are  considered:  national  language  (which  include  both  statutory  and  de  facto  languages),  regional  language 
which  is  the  most  widely  spoken  language  in  a  given  ADM-1  area,  pivot  sister,  which  is  the  closest  relative  with  sizeable  technical  or  data  resources,  and  pivot  cousin,  which  can  fill  in  for  a  missing 
sister. 

Every  language  may  have  as  many  as  four  roles.  All  figures  shown  for  each  language  reflect  only  the  events  for  which  it  is  counted  as  a  national,  regional,  sister,  or  cousin  language. 

NB:  Scroll  the  table  up  to  the  top  of  the  frame,  then  click  (or  shift+click)  column  heads  to  sort. 


pop  * 

role  * 

investment  language 

*  ISO* 

links 

*  family 

* 

LDC  * 

MT? 

*  orth  * 

CRCL  * 

events  * 

dead  * 

affected  * 

$mil  * 

32,035,300 

1  National 

Burmese 

mya 

90 

Sino-Tibetan 

G 

AC 

HL  Y1  Y3? 

73 

417,742 

81,715,825 

$14,084 

344.000 

2  Regional 

Tedim  Chin 

ctd 

20 

Sino-Tibetan 

AC 

9 

247 

9,296,843 

$119 

3,295,000 

2  Regional 

Shan 

shn 

18 

Tai-Kadai 

AC 

HL  Y1  Y3? 

9 

292 

9.248.885 

$122 

100,100 

2  Regional 

Tase  Naga 

nst 

16 

Sino-Tibetan 

HL  Y1 

6 

332 

9,040.410 

$121 

1,800.000 

2  Regional 

Rohingya 

rhg 

8 

Indo-European 

HL 

12 

609 

9,696.525 

$686 

940.000 

2  Regional 

Kachin 

kac 

7 

Sino-Tibetan 

AC 

HL  Y1 

8 

226 

9,295,943 

$119 

1.050.000 

2  Regional 

Pwo  Eastern  Karen 

kjp 

6 

Sino-Tibetan 

7 

138.513 

11,677.700 

$4,119 

150,000 

2  Regional 

Western  Kayah 

kyu 

6 

Sino-Tibetan 

AC 

2 

138,383 

2,420,300 

$4,000 

32,035,300 

2  Regional 

Burmese 

mya 

3 

Sino-Tibetan 

G 

AC 

HL  Y1  Y3? 

4 

306 

9,331,183 

$177 

The  initial  center  bottom  response  to  the  “Myanmar”  query.  There  are  two  tables  (one  fixed,  one  sortable)  above. 

•  fixed  table  this  summarizes  the  number  of  major  events  that  affected  the  search  query  area.  For  each  event  type,  we  also 
summarize  the  number  of  speaker  communities  affected.  Varying  terrain  can  cause  these  to  vary  greatly. 

•  sortable  table  this  table  estimates  investment  language  rank  and  benefit;  i.e.  the  language(s)  for  which  it  would  be  most 
useful  to  have  advanced  resources. 

•  pop  the  language  speaker  population,  per  Ethnologue. 

•  role  worldwide,  communities  tend  to  be  multilingual.  The  most  common  second  languages  tend  to  be  either  the  national 
language  of  education,  or  a  regional  language  of  province-,  state-,  or  island-wide  communication  (which  may  also  be  a 
language  of  education). 

For  our  purposes,  a  pivot  sister  language  is  the  closed  etymologically  related  language  that  has  “substantial”  resources, 
preferably  machine  translation.  A  pivot  cousin  is  a  step  removed.  As  a  practical  matter  the  fact  that  a  language  is  a  sister  or 
cousin  does  not  necessarily  mean  that  it  will  be  close  or  comprehensible.  We  have  suppressed  some  (but  not  all)  of  the 
artifacts  that  result  from  relying  on  standard  linguistic  subgrouping;  this  can  be  improved.  Note  that  a  single  language  (like 
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Burmese  may  fill  multiple  roles:  it  is  a  national  and  regional  language,  and  is  also  etymologically  close  to  some,  but  by  no 
means  all,  of  the  Sino-Tibetan  languages  spoken  in  Myanmar. 

•  investment  language  as  mentioned  earlier,  the  national  language  is  most  likely  to  have  good  technology  support,  but  is  not 
necessarily  the  best  pivot  language  for  bootstrapping  MT  and  similar  tools.  In  effect,  each  row  provides  data  that  assists 
decision-making  on  whether  an  investment  language  should  be  pre -provision,  and  what  it  should  be. 

•  ISO,  family  the  ISO  639-3  code,  and  conventional  language  family  name. 

•  links  the  number  of  languages  for  which  the  current  language  plays  the  stated  role.  For  example,  Thai  is  listed  as  the  sister  of 
four  Tai-Kadai  languages  spoken  in  Myanmar. 

•  LDC,  MT,  ortho,  etc.  Summary  totals  of  resources  and  events,  as  in  the  center  top  table. 
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Center  Top  (repeated  from  above) 


7102  languages  seen,  90  matched,  in  about  9  seconds.  Showing  event  data  for  1900-2016  only.  Mouse  over  any  column  head  for  details,  click  to  re-sort  (shlft+click  for  multiple  sort  columns).  Click 
any  Language  or  Country  to  see  related  events 


rank  (%ile)  $  pop  J  ISO  $  language  ±  region  $  country  $  ADM  1  J  national  ^  regional  i  pivot  ±  family  $  LDC  ±  MT?  i 


37  (1%) 

32035300 

MYA 

Burmese 

SE  Asia 

Myanmar 

Magway 

Region 

myai 

myaT 

Sino-Tibetan  G 

191  (3%) 

3295000 

SHN 

Shan 

SE  Asia 

Myanmar 

Shan  State 

myaT 

shn 

Eth:  lao  t  tha  tb  Tai-Kadai 

Giotto  i 

orth  $ 

CRCL  $ 

events  $ 

dead  i 

affected  $ 

$mil  i 

l:dwc/g:g 

s/p:p 

AC 

HLY1 

Y3? 

4 

306 

9.331,183 

$177 

l:dwc/g:gs 

AC 

HLY1 

Y3? 

9 

292 

9,248.885 

$122 

Center  Bottom  (following  click  of  country  “Myanmar”  in  center  top) 


121  events  found  (66  EM-DAT,  55  GLIDE),  white  seen  and  counted  for  at  least  one  ADM-1  in  the  (possibly  restricted)  search  above,  green  seen,  but  no  ADM-1  recognized  or  language  counted. 
blue  GLIDE  data,  not  counted  (ADM-1  is  not  specified). 


$  ID 

w  From 

#  To 

v  Country 

0  abb 

v  Location 

#  Type 

v  Subtype 

£  Deaths 

£  Affected 

$  $000 

G 2016-000088 

2016-08-24 

Myanmar 

MMR 

A  powerful  6.8  magnitude  earthquake  struck  central  Myanmar  Wednesday,  killing  at  least  three  people  and  damaging 
some  60  pagodas  in  the  famous  ancient  city  of  Bagan. 

Earthquake 

G 2016-000092 

2016-08-19 

Myanmar 

MMR 

Tropical  storm  Dianmu  formed  in  the  South  China  Sea  on  17  August  and  passed  through  Lao  PDR  around  2  days 
later,  causing  additional  heavy  rain  which  has  been  occurring  since  11  August.  Currently  several  districts  in 
Luangprabang,  Houaphan  and  Xaingabouli  are  affected,  as  indicated  below. 

Flash 

Flood 

G 2016-000058 

2016-06-09 

Myanmar 

MMR 

Heavy  monsoon  rains  since  the  beginning  of  June  have  caused  flooding  in  five  states  and  regions  of  Myanmar. 
According  to  the  initial  reports  from  the  Government  Relief  and  Resettlement  Department,  at  least  26,000  people  are 
affected  in  Ayeyarwady.  Bago  and  Sagaing  regions  as  well  as  Chin  and  Rakhine  states.  A  total  of  14  deaths  have 
been  reported  from  the  Union-level  Relief  and  Resettlement  Department,  media  sources  and  the  Rakhine  State 

Government. 

Flood 

E  2016-0232 

2016-06-01 

2016-06-24 

Myanmar 

MMR 

Sagaing,  Bago  regions,  Rakhine  state 

Flood 

- 

14 

3000 

- 

E  2016-0224 

2016-05-23 

2016-05-23 

Myanmar 

MMR 

Hpakant  region 

Landslide 

Landslide 

42 

15 

«■ 

G 2016-000052 

2016-05-20 

Myanmar 

MMR 

Tropical  Cyclone  ROANU  continued  moving  north-east  over  the  western  Bay  of  Bengal,  near  the  eastern  coasts  of 

India,  retaining  its  intensity.  On  20  May  at  0.00  UTC  its  centre  was  located  approx,  80  km  south-east  of  Srikakulam 
district  (Andhra  Pradesh  state,  India)  and  it  had  max.  sustained  wind  speed  of  83  km/h.  Over  the  next  48  h,  the  cyclone 
is  forecast  to  strengthen  as  it  continues  moving  north-east.  It  may  reach  Chittagong  division  (Bangladesh)  on  21  May 
with  estimated  max.  sustained  winds  of  100-130  km/h.  Heavy  rain,  strong  winds  and  storm  surge  are  expected  to  affect 
southern  Bangladesh  and  western  Myanmar/Burma.  A  storm  surge  of  1.5  m  is  expected  on  the  coastal  area  of 

Kutudbia  (Cox's  Bazar,  Bangladesh)  on  21  May  morning  (UTC). 

Tropical 

Cyclone 

E  2016-0189 

2016-04-29 

2016-05-03 

Convective 

18 

879 4.4 

-2600 

At  present,  clicking  a  language  or  country  cell  drills  down  to  the  related  events.  Above,  121  events  were  reported  for  the  “Myanmar” 
query:  66  from  EM-DAT  (given  linked  “E  nnn”  numbers),  and  55  from  the  GLIDE  set  (given  “G  nnn”  numbers).  The  different 
background  colors  are: 
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•  white  we  were  able  to  properly  extract  at  least  one  ADM-1  area  for  this  event  from  the  EM-DAT  dataset  (which  provides 
relatively  regular  listing  of  locations).  The  “E”  number  is  actionable  -  in  effect,  it  pre-populates  the  “Find  a  single  EM-DAT#” 
text  entry  in  the  menu,  then  searches  for  all  languages  and  ADM- Is  associated  with  that  event. 

•  green  we  were  able  to  match  the  country  (Myanmar),  but  not  the  ADM-1  entity  ( Hpakant  region  could  not  be  parsed). 

•  blue  GLIDE  data.  These  reports  have  much  more  descriptive  detail,  but  there  is  no  regular  encoding  of  casualties,  costs,  or 
impact  area.  We  intend  to  align  the  GLIDE  and  EM-DAT  datasets. 

As  noted,  both  EM-DAT  and  GLIDE  numbers  are  used  in  other  disaster-reporting  contexts.  We  intend  to: 

•  provide  access  to  the  raw  EM-DAT  and  GLIDE  data,  and 

•  attempt  to  locate  and  link  to  any  external  data  or  sites  related  to  the  individual  events. 
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Maps  (far  right) 


Map  3:  each  ADM-1,  counted  and  weighted  by  sqrt(number  of  events) 


Map  Satellite 


Map  data  §20 '6  Google.  SK  telecom  Imagery  ®2016TerraMetrics  |  Terms  of  Use 


Map  4:  each  event,  v.  eighted  by  Ini  average  number  affected  split  across  each 


Map  Satellite 


Wap  data  82016  Google  Imagery  §2016  TemaMetncs  |  Terms  of  i 


lopulation)  for  50.000  cities  with  population  over  5.000 


Map  Satellite 


Map  data  £1016  Google  SK  telecom.  ZENRIN  Imagery  ©2016TerraMetrics  |  Terms  of  Us 


Google 


Thailand 


Map  data  §201 6  Google.  SK  telecom  Imagery  S2C1  6  TerraMetrics  |  Terms  cf  Use  | 

|Map  6:  GLIDE  events,  weighted  by  Infevents  per  point)  Some  points  may  be 
Idefault  country  locations  M 

_ _ _ 

Satellite  I 


PRADESH  ^  v  : 

y  /  *  |  s, 

—/■Kfi  ASSAM  f  C 

-  -  NAGALAND  J 

MEGHALAYA  J  jT 

U  .YUNNAN 

L  TRIPURA  )  -  \ 

BENGAL  ’  MIZORAM  •  j  r 

I  V  I  T  Myanmar  w;.  J 

•  if  (Burma)  d4 

'  V-,  La, 

AW  ft  V  7  »  *  ) 


We  generate  six  heat  maps  based  on  the  query  for  demonstration  purposes  (they  render  all  7,100  points  very  quickly).  None  of  these 
maps  are  actionable,  but  that  is  an  obvious  next  step.  They  are: 

•  Map  1  language  density  in  the  query  area  (in  this  case,  Myanmar). 

•  Map  2  each  language  is  weighted  by  the  number  of  events  it  is  involved  in. 

•  Map  3  each  of  Myanmar’s  15  ADM-1  regions  is  treated  as  a  centroid  point,  and  weighted  by  number  of  events. 

•  Map  4  each  event  is  weighted  by  the  log  of  the  average  number  of  people  affected,  split  across  affected  ADM- Is. 

•  Map  5  relative  populations  for  all  cities  >  5,000. 

•  Map  6  GLIDE  events  (which  are  given  lat/long  points),  weighted  by  number  of  events/point  (GLIDE  sometimes  uses  a  single 
point  as  the  nominal  location  of  many  events). 
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This  page  is  intentionally  blank. 
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Appendix  F:  Tool  snapshots 

(taken  from  the  project’s  Y1Q3  report ) 


Tool  snapshots 

CRCL  is  willing  to  provide  access  to  many  of  our  internal  tools  to  other  LORELEI  performers. 
There  arc  four  web-based  platforms: 

~project/lorelei/data  tools  that  focus  on  exploration  of  source  texts.  They  provide  highly 
detailed  overviews  and  analyses  of  all  data  within  one  or  more  lects  found  within  a  given  text. 

~project/lorelei/dict  tools  that  allow  more  traditional  dictionary  queries  based  on  semantic  and 
phonological  criteria.  Sources  may  be  restricted  by  author,  language,  phylogenetic  subgroup, 
or  geographical  region  or  proximity. 

~project/lorelei/cogs  the  tools  we  use  for  exploring  and  creating  cognate  sets.  They  incorporate 
functionality  for  semantic  fallback  also  see  on  the  /diet  page. 

~project/lorelei/down  the  project  download  page.  At  this  point  we  only  link  to  prepared  sets. 
However  (given  the  complexity  of  the  other  pages)  we  will  probably  build  in  hooks  to  allow 
preparation  and  download  of  customized  sets. 

Please  note  that  these  pages  are  built  by  and  for  the  CRCL  development  team.  They  are: 

•  beyond  the  scope  of  defined  project  deliverables,  and  not  documented  in  detail, 

•  usually  built  to  assist  our  own  internal  data  audit  and  evaluation, 

•  subject  to  change  at  any  time,  and  not  guaranteed  to  be  stable  or  persistent. 

We  are  exposing  them  in  order  to: 

•  reveal  the  full  extent  of  our  datasets,  including  implicit  as  well  as  explicit  content, 

•  clarify  our  capacities  for  data  analysis  and  extraction, 

•  encourage  requests  for  non-traditional  data  applications. 

Essentially  all  functionality  is  provided  by  REST  calls,  and  could  be  made  accessible  via 
external  http  queries  (i.e.  for  machine-handling  of  returned  data).  Indeed,  it  must  be  understood 
that  the  puipose  of  many  of  these  tools  is  simply  to  instantiate  and  help  visualize  (for  testing 
puiposes)  the  results  of  information  extraction  functionality  that,  in  the  long  run,  will  be  used  in 
machine-to-machine  communications. 
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/data  overview 

Normalizing  transcribed  data  seems  simple,  but 
given  many  sources  and  ill-defined  transcription 
systems  (sometimes  co-occurring  in  a  single  text) 
producing  results  that  are  consistent  and  accurate 
is  extremely  difficult.  This  page  provides  our 
main  overviews.  We  begin  with  a  quick 
overview  of  the  menu. 

Sketch  and  inspect  name  relevant  source  texts 
(bib refs)  and  lects  (logical  columns). 

Sketches  provides  various  content  inventories 
and  counts,  including  phonemes, 
onset/nucleus/coda  segments,  canonical  syllable 
shapes,  and  the  like. 

Form/gloss  presents  tables,  usually  in  a  compact 
form,  of  gloss  and/or  phonological  form  content. 
Like  the  next  few  functions,  it  is  intended  to 
provide  a  quick  overview  of  the  content  of  a 
typical  500-2,500  item  lexicon,  and  is  mainly 
used  to  oversee  the  automated  processes  that 
control  semantic  and  phonoloigcal  normalization. 
MetaGloss  summary  tabulates  and  counts  all 
normalized  gloss  forms  by  our  extended  part-of- 
speech  definitions.  . 

Syllable  table  tabulates  individual  lect  content 
by  syllables,  allowing  sorts  by  onset,  nucleus, 
coda,  and  count. 

Segment  table  provides  a  global  view  of  various 
syllable  constituents  for  all  datasets  in  a  source. 
The  different  view  and  sort  options  help  spotlight 
each  of  the  underlying  conversion  decision 
processes. 

Seg  summary  extracts  and  analyzes  all  syllable 
components  from  the  complete  dataset.  It  reveals 
the  low-frequency  elements  that  are  more  likely 
to  be  errors,  and  provides  a  basic  sanity  check  on 
the  dataset  as  a  whole. 

Cover  &  contrast  tables  answer  two  questions: 
what  is  the  (probably)  smallest  subset  of  word 
that  demonstrates  all  of  a  language’s 
phonological  features,  and  what  is  the  complete 
set  of  words  that  demonstrates  all  positional 
contrasts  (hat  vs  cat  contrast  onset  h/c). 

Assemble  for  download  packages  the  contents 
of  these  sources  for  inspection  or  download. 
Semantics  applies  various  measures  of 
sentiment  to  lexicon  semantics,  and/or  reveals  co-lexification  (use  of  the  same  word  for  different 
semantic  concepts) 

Coverage  overview  provides  summary  and  detailed  tables  of  linguistic  coverage  and  content, 
excluding  lects  with  fewer  than  some  minimum  number  of  items. 
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(local  testing  only)  lest  one  form  (bibref  &  column  required,  below ) 
0  preClean  LI  syllabify  LI  byRule 


Sketch  and  inspect 


huffmanl979vocabulary 


(these  two  must  be  filled) 
I bibref 


1-11 


layout  ||==| 


col(s)  (rt,  n-m,  n,m,  n-m,o) 
||  ||J  wide  /tall 


Sketches 


sketches  □  add  phono  notes 


breakout  □  onset  □  nucleus 


Form/gloss  (detailed) 
glosses  forms 


reset 


demo 


demo 


compact  cols  9 


son  ®a,b,c  Oerrs  Olsyls  Olen  O compact 
glosses  O  copper  LI  bronze  LI  silver 
forms  0  copper  0  bronze  0  silver 
MetaGloss  summary  0  ITT  min  Oa,b,c  ®  3,2,1 


Syllable  table 
syllables 

son  Oonset  Ov*  ®  coda  03,2,1  O  reverse 


width  50 


demo 


Segment  table 
segments  rotate  0 


repeat  labels  every  \  30  v|  rows  40  v  |  cols 
show  Oonset  ®V*  Ocoda  Oc|C  Ov|V 
son  Offchars  Oa.b.c  Ol,2,3  ®3,2,1 
Seg  summary  0  1 100  v|  min  items 


Cover  &  contrast  tables 
cover  sets  ®  O+N+C 


contrasts  ®  O+N+C 


demo 


demo 

demo 

demo 


Assemble  for  download 

detiver  Enter  bibref/ cols  above  demo  xml 

Format:  Otsv  ®xml  Table  O  htm  Otsv  demo  tsv 
Content  (xml):  0  metadata  0data 
Sample:  [3  Tl 


Table:  (rotate  table)  O 

glosses  0  copper  0  bronze  0  silver 

forms  0  copper  0  bronze  0  silver 


demo  table 


Semantics  show 


demo 


®sentiment 
Ocolexification  Oa.bc 


®  3.2.1  ITT 


colex  min 


overview 

details 

reset 

j  item  minimum 

0AA  theraphan2001languages_1*  (x14) 
0AA  theraphan2001languages_2’  (x7) 
0AA  huffman1971vocabulary*  (x18)| 
0M  huffman1979vocabulary”  (xll) 
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/data  examples 

Sketch  As  noted,  our  initial  interest  in  this  view  is  simply  to  get  a  bird’s-eye  view  of  the  results 
of  phonological  conversion.  There  is  a  built-in  mechanism  ( add  phono  notes )  that  displays  any 
available  data  from  PHOIBLE  or  the  World  Phonotactic  Database.  The  Shapes  are  sorted  first 
by  length  +  alphabet,  and  then  by  frequency.  The  DiSylCon  and  DiSylVow  entries  show  word- 
internal  syllable  boundary  conditions  for  consonant  and  vowels.  Many  elements  are  actionable  - 
a  double-click  on  one  of  the  shapes  will  find  all  forms  with  that  shape. 


As  with  other  items,  all  suggestions  regarding  additions,  refinements,  and  more  convenient 
means  of  providing  access  to  these  data  are  welcome. 


s/s  S/C  C/C  cG/C  sG/C 


form/gloss  table  A  E 
syllable  table  O  N  C  preset  tables 

lsm2015chin  1  [rtc]  Rungtu  Chin 

447  syllables  (390  distinct),  262  syllable  boundaries  added 
vowels:  eiouaoaeiiui  (11) 
consonants:  bdhjklmnprstvwi)ij?lj(19) 
diacritics:  -  (1) 

Inventory:  a(1943  6(143)  **(133)  ^lSO)  **(128)  1(126)  0(119)  *(102)  k(90)  j(7S)  ?(71)  m(7D  P(67)  3(59)  f(57)  ?(49)  S(4S)  1(36)  *(31)  e(24)  °(24) 
"(23)  h(22)  b(20)  d(ls)  9(lS)o(lS)  v(18)  tf(16)  i(16)  W(15)  m(14)  P(8)  '(4)  S(4)  “(4)  “(4)  (3)  (2)  ^(1)  *(1)  '(1)  *(1) 

Shapes  1:  Vra  cv(313,  VCav  ccv,52l  cvc(229l  cw„,  ccvc,42!  ccw(s)  cwc(s)  cccvc(1)  ccwc(1) 

Shapes  2:  CVt313)  CVC^,,  CCV(s2i  CCVC(42)  V(42)  VC(11)  CCW(S)  CWC(S)  CW(4)  CCCVC(1,  CCWC(1) 

Marked:  "C(S)  G(31) 

Onset:  |  b(10)  brw(1)  b^  |  d(14)  d'N(1)  |  h(20)  |  j(17)  jW(2>  I  k(32)  krC2)  ^h(32)  khrcl)  khwc2)  kW(2)  I  1(27)  l'V(2)  I  m(27)  ^(9)  Ote) 

^(1)  I?Wa)  I  n(50)  9(2)  I  P(21)  pl(l)  Pr(7)  Ph(9)  P\l)  Phj(3)  P^(9)  I  r(20)  r"(l)  J(9)  |  S(4)  Sr(l)  ^(30)  S^d)  S'V(1)  I  ^(56)  ^(2)  ^(20)  ^(1) 
tW(5)  I  v(13)  Y(l)  I  W(7)  |  0(17)  rjw(l)  I  ^(1)  I  1(11)  I" (2)  I  ^(26)  |  tf(S)  ^ (5)  jb(l)  ^(l)  7d(l)  ?h(2)  7j(2)  ^(7)  7kr(3)  7kh(l)  ?1(4)  71"(1)  ?m(5) 
7mi(l)  ?n(3)  ?P(3)  7Pr(l)  ?Ph(3)  7Phr(l)  ?r(3)  7t(l)  ?S(1)  7sh(7)  ?V)  ?lh(2)  ?V(2)  ?0(2)  7J(3)  7^(1)  ?tf(2)  7^ a)  S<1(2)  *0(1)  ^(1)  ^(1)  ^^(l) 
khP(2)  m^h(3)  mt(D  pk(2)  Dkha)  Dncl)  pSd3  Dsha)  Dtha)  Dija)  tthcl3  *v(1)  l?(2) '!)(!)  “Pq)  %!)  DshCi)  °1(d 
Nucleus:  |  e(133)  ei(5)  |  i(H7)  |  o(24j  |  U(117)  ui(4)  ua^  |  0(is2)  ae(5)  I  3(59)  I  9(is)  I  e(24)  I  l(i)  I  *(d  I  m(i4) 

Coda:  j(57)  k(2)  nC74)  P(s)  rd)  1(6)  w(s)  0(97)  ?(i9) 

Syllabic?:  shm(1) 

DiSylCon:  j|h(1)  j|jaD  j|ka)  j|lwa,  j|ra)  j|sh(2)  j|rja)  k|phra)  m|b(1)  m|k(1)  m|l(2)  m|m(2)  m|ph(13  n|ba)  n|b>(1) 
n|d(2)  n|hC2D  n|k(4)  n|mC2)  n|ncl3)  n|prcl)  n|r(3)  n|t(6)  n|rj(3)  n|Ja)  n|ra)  n|?(1)  p|Jaj  old^  o|hcl)  o|j(i)  o|kC2) 

o|kh(8)  o|1(d  0|m(4)  olip’a)  0|n(5)  o|ncl)  o|PP)  o|PhCn  olP^u  o|r(2)  o|s(d  ol^a,  oltp)  o|tha)  olo(S)  ?|d(i)  ?|na) 

?|P(D  71^  shm|kha) 

DiSylVow:  i|e(4)  ole^  ule^  uju^  a|e(S)  a|?(2)  oje^  e|ea)  ui|e(1) 


form/gloss  table]  |  A  \  \  E  \  \  L  \  \  s/s  \  \  S/c\  \  C/c\  \  CG/c\  \  SG/C\ 


syllable  table  O  N  C  preset  tables 

lsm2015chin  2  [cnb]  Chinbon  Chin 
461  syllables  (455  distinct),  258  syllable  boundaries  added 
vowels:  eiouaoaeiui  (10) 

rnncnnantf  hHhiHmnnctv  w  7nnfnf?tf  ( 91  ^ _ 


The  Shapes  are  actionable,  and  trigger  a  source  lookup.  Below,  all  CCVV  syllables;  note  that  by 
design,  the  aspirated  /ph/  is  detected  as  a  single  character  while  the  palatalized  /pj/  forms  are  not: 

Expansions:  C  to 

"p|b|b|m|B|<J>|P|w|n]Mf|v|o|0|d|t|d|t|4|dln,|n|r|r|s|z|E|a|ts|dz|t6|cb,|4|b|J|UI|t|ctlnltlJl3lgW-lll|c|il 
V  to  "i|y|t|w|iu|u|i|Y|iJ')|u|'L||u|e|0|9|e|T|o|3|E|c|c3e|3|o|A|3|ae|e|a|CE|A|a|o" 

8  items  found,  8  items  returned  in  2  seconds  (note  limit  of  50  items  per  doculect) 


copper 

silver 

gloss 

ISO 

language 

family 

bibref 

phJQe.le 

phJae|le 

butterfly 

rtc 

Rungtu  Chin 

ST 

lsm2015chin  1 

phJae 

phiae 

thigh 

rtc 

Rungtu  Chin 

ST 

lsm2015chin  1 

p.rua 

prua 

man 

rtc 

Rungtu  Chin 

ST 

lsm2015chin  1 

n.rua 

nrua 

woman 

rtc 

Rungtu  Chin 

ST 

lsm2015chin  1 

piae 

p'ae 

to  give 

rtc 

Rungtu  Chin 

ST 

lsm2015chin  1 

?.phu.p’ae 

!phu|p*ae 

to  pay 

rtc 

Rungtu  Chin 

ST 

lsm2015chin  1 

p.kua 

ckua 

nine  (persons) 

rtc 

Rungtu  Chin 

ST 

lsm2015chin  1 

t.va/p>ae.khat 

Va  /  p'ae|khat 

half  (quantity) 

rtc 

Rungtu  Chin 

ST 

lsm2015chin  1 
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Forms  A  view  of  raw  copper  and  machine-processed  silver  forms  from  Tryon’s  Austronesian 
data.  The  silver  columns  show  normalization  of  the  transcription,  and  syllabification  of 
individual  forms.  These  views  let  us  review  large  amounts  of  data  quickly,  identifying  whether 
irregularities  are  due  to  our  processing,  or  were  found  in  the  raw  data. 


Max  1280  forms  found  The  row  numbers  are  arbitrary  and  rows  in  different  columns  ate  not  related  Each  column  is  sorted  by  alpha  You  can  adjust  the  content  using  the  mam  menu's  form/gloss  table 


Columrvlect  1 

Columrvlect  2 

Columrvlect  3 

Column'loct  4 

Golummiect  5 

low 

[lay]  AlayaJ 

[tsul  Tsou 

[dru]  Rukai 

[pwnj  Paiwan 

[taoj  Yami 

capper  ||  silver 

capper  ||  silver 

capper  ||  silver 

co pper  ||  silver 

copper  ||  silver 

0 

afcpu* 

»q|Pux 

x-fku-flmqu 

af|kuf]ku|qu 

ao|o 

ao|lo 

a-uta 

aw|la 

abtan 

abjlan 

1 

aupun 

aw  |  pun 

afs-a 

af|ja 

ababay 

a|ba|bai 

al'ak 

ajl’ak 

abto 

ab|to 

2 

ayijtaw 

ajilpaw 

a-f7oii 

af|te||l 

ababay 

a|ba|baj 

al'jik 

a|Vak 

abtak 

ab|tak 

3 

aylq 

ajiq 

ak?i 

ak|7i 

abablraw 

a:ba,bl  raw 

al'ak 

a|l’ak 

ai 

aj 

4 

ami! 

a|ml! 

ak7e  glra 

ak|?c|gl|tsa 

a  ba  bOqo 

a|ba|bo:|go 

aPak 

alPak 

al 

aj 

5 

amH 

a  mi! 

a-Ruo 

am|J3 

xbik* 

a|bajk» 

al'ak 

aJVak 

al  no  xlala 

a)  no  a| l*[b 

6 

aklh 

a  qih 

a-pta-ptaigi 

apjtap  tai  gl 

abara 

a|ba|ra 

al'ak 

ajkak 

ai  no  vauay 

aj  no  va  vaj 

7 

aklh  lal-xn 

Ji’qih  ta:|lan 

a-lpitl-a 

al|p«|tla 

alxau 

»|ba|r» 

al'ak 

ajpak 

akml  kad»i 

ak|mi  kajcl*j 

8 

aklh  yl? 

a;qlh  yl? 

atvoxi 

at|va|xi 

a  bo 

a|bo 

al'ls 

a|Ws 

akpavan 

ak|pa|iran 

aklh  ?a  ?utux 

a|  qih  7a  7u|tux 

au  prtro 

aup|tsl|p 

a  dallf 

a|daj|ll 

al'u 

a|fu 

aktobto 

ak|tok|to 

10 

aflya? 

aSJl|ya? 

au  t?o  t?ou 

aut|?at|?aw 

akama 

a|ka|m» 

iPu-aPu 

a|Pua|l>u 

aktaban 

akra  ban 

11 

a-51-J3ak-i-pak-i 

ajJi!Pa|qi||3a|ql 

av?u 

av|7u 

akoadao 

a{koa|dao 

a-nama 

a|na|ma 

akdotan 

akjd,o|tan 

12 

agrt? 

arj|r17 

a  XTOSl 

ax|ta|fl 

alo  alo 

a|loa|b 

asaw 

a|saw 

amvatag 

am|ja|taq 

13  hayiiq 

hajjitq 

a-ko-kcHti 

a|lo|lo|tu 

a-lab* 

a|b|ba 

aijal' 

a|qal 

amlavi 

am|la|vl 

14 

hayfay 

hajjtay 

amo 

a|ma 

ama 

a|ma 

ajay 

a|W 

amlokolokog  am|lo|ko|lo|kog 

15 

hamhum 

ham|hum 

amo 

a|no 

ama 

a|ma 

a]u 

a|lu 

amlolof 

am|to|lo$ 

16 

hanku 

han  ku 

amo 

ajma 

ama 

a|ma 

blbi 

bi|bi 

amnovo 

am|no|vo 

hauku? 

haw|ku? 

amo 

ajma 

ama 

a|ma 

bu|a-bu|ay 

bu|la|bu|laj 

amvovog 

am|vo|jofl 

18 

hawti? 

hawjtl? 

Miuxonl 

a|ma|tu|nl 

aina 

ajma 

tnicak 

bu|teaq 

anyoy 

anljoj 

19 

ha-hapil 

ha|ha|pil 

a-tavri-si 

ajta|v{i|J! 

ama 

a|ma 

banr-barug 

b»|ru|ba|ruq 

angat 

an|gat 

20 

hahtpav 

ha|hl|paj 

a  trurunu 

a|tcu|[u|nu 

amay 

a|maj 

caynan 

caj|nan 

aob 

aob 

21 

hahipux 

ha|hi|pux 

a-xoi 

a|xaj 

amaniklno 

a|ma|ni  kl|no 

cay-saga-sagas 

cgi  sa  ga|sa  gas 

aob 

aob 

22 

ha  htrhir 

ha|hl|r*fr 

a  qare 

ajoalt* 

anaana 

a|naaina 

cayvl|i-vl|IP 

faJ|vl|U|vl|Ul' 

apnarak 

ap|na|rak 

23 

hakii? 

h*|krl? 

a-qaro 

a|qa  tvj 

xpUkaa-kxna 

a|pia|luta|ka  ua 

cay-vt|l-vf|iP 

caj|vi|ll|vi|llk 

appa 

ap|pa 

Glosses  These  may  be  extracted  and  viewed  standalone  in  order  to  make  is  easier  for  LORELEI 
collaborators  to  understand  the  content  and  ordering  of  the  various  comparative  and  survey 
elicitation  lists.  Below,  part  of  the  LSM2015  list.  Note  that  while  most  entries  are  given  as 
WordNet  form#POS#sense-number,  we  rely  on  extended  forms  for  many  kin  terms  (“Obro.fem” 
=  “older  brother  of  female”),  and  other  terms  that  are  widely  lexicalized  in  Asia-Pacific.  A  small 
amount  of  inconsistency  and  uncertainty  are  expected  for  these  silver  glosses. 


Total  441  distinct  forms  found  (longest  list  446) 

Compact  sort  (in  6  columns)  Showing  gloss  forms  only  Multiple  lists  will  be  merged  in  a  single  table 

I#P#1 

Obro.fem#k#l 

Obro.mal#k#l 

Osis.fem#k#l 

Ybro.fem#k#l 

Ybro.mal#k#l 

Ysis.fem#k#l 

Ysis.mal#k#l 

afraid#a#l 

all#a#l 

angry  #a#l 

ant#n#l 

arm#n#l 

armpit#n#l 

arrow#n#2 

ascend#  v#l 

ashamed#a#l 

ashes#n#l 

back#n#l 

bad#a#l 

bald#a#2 

bamboo#n#2 

bamboo.shoot  #n#  1 

banana#n#2 

bark#n#l 

bark#v#4 

barking deer#n#l 

bathe#v#3 

bear#n#l 

beard#n#l 

bee#n#l 

beer#n#l 

belly#n#l 

bend#t#3? 

betel  nut#n#l 

big#a#l 

bird#n#l 

birdnest#n#l 

bite#v#l 

bitter#a#6 

black#a#l 

blanket#n#l 

blind#a#l 

blood#n#l 

blow#v#l 

blunt#a#l  |blunt#a#2? 

body_hair#n#l 

boil#t#2 

bone#n#l 

bow#n#4 

brain#n#l 

branch#n#2 

breathe#v#l 

buffalo#n#4 

bum#t#l 

bury#v#2 

butterfly  #n#l 

buy#v#l 

calf#n#2 

cane#n#2 

cat#n#l 

cheek#n#l 

chicken#n#2 

chin#n#l 

choose#  v#l 

clothing#n#l 

cloud#n#2 

cockroach#n#l 

eold#a#l:feeling 

comb#n#l 

come#v#l 

cook#t#2 

cooked rice#n#0 

cool#a#l:object 

corn#n#l 

correct#a#l 

cough  #v#l 

count#v#l 

cow#n#l 

crawl#v#l 

crest#n#5 

crocodile#n#l 

crossbow#n#l 

cut#v#l 

dance#v#l 

dark#a#l 

day#n#4 

deaf#  a#  1 

deep#a#3 

descend#  v#l 

die#v#l 

difficult#a#l 

dig#v#l 

dirtv#a#l 

do  not#x#l 

dog#n#l 

door#n#l 

dream#  v#  2 

drink#v#l 

drum#n#l 

drunk  #a#l 

dry#a#l 

dry#t#l 

dust#n#l 

ear#n#l 

earth  worm  #  n  #  1 

east#n#4 

easy#a#l 

eat#v#l 

egg#n#2 

eggplant#n#l 

eight#n#l 

elbow#n#l 

elephant#n#l 

elephant_tusk#n#0 

enter  #v#l 

exchange#v#2 

excrement#n#l 

extinguish#v#2 

eye#n#l 

eyebrow#n#l 

eyelid#n#l 

face#n#l 

fall#i#l 

far#a#l 

fast#a#l 

fat#a#l 

father  #n#l 

feather  #n#l 

few#a#l 

field#n#l:dry 

field#n#l:wet 

fight#v#l 

fingernail  #n#l 

fire#n#3|fire#n#7 

firewood#n#l 

fish#n#l 

five#n#l 

float#i#l 

flow#v#2 

flower#n#2 

fly#n#l 

forehead#n#l 

forget#v#2 

four#n#l 

free#v#l 

friend#n#l 

frog#n#l 

fruit#v#2 

25  _ 

full#a#l 

garlic#n#l 

ghost#n#3 

ginger#n#3 

give#v#l 

go#v#3 

go.out#v#l 

gold#n#3 

gong#n#l 

good#a#l 

grassland#n#l 

green#a#l 

grind#v#5 

gums#n#2 

half#a#l 

hard#a#3 

hate#v#l 

he&she#p#l 

head#n#l 

head  hair#n#l 

head  louse#n#l 

hear#v#l 

heart#n#2 

heavy#  a#  1 

heel#n#2 

hide#v#2 

hit#v#3 

horn#n#2:buffalo 

hot#a#l:feeling 

hot#a#l:object 
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Metagloss  summary  This  summarizes  the  entire  dataset.  Below  m  marks  modals,  x  are 
unassigned,  j  are  conjunctions,  r  are  adverbs,  etc.  Classes  may  be  assigned  algorithmically,  e.g. 
we  can  distinguish  open  and  closed-class  adverbs.  The  items  are  WordNet  3.0  senses,  with 
extensions  as  needed.  New  classes  (e.g.  pronouns  or  kin  terms)  are  numbered  beginning  with  1, 
while  the  standard  n,  v,  a,  r  classes  are  numbered  0.  Note  that  because  of  wide  variation  in  raw 
glosses,  and  corresponding  difficulty  in  disambiguating  senses,  in  some  cases  precise  assignments 
will  not  be  completely  resolved  until  we  are  further  along  in  cognate  grouping. 


3909  distinct  glosses  from  441061  items.  Minimum  cutoff  is  3. 

This  view  ignores  :  modifier  elements,  and  treats  i(ntransitive)  /  t(ransitive)  items  as  v(erbs). 


ni:  2/41  forms  items 

x:  5/456  forms  items 

j:  7/713  forms  items 

d:  5/1259  forms  items 

q:  12/3489  forms  items 

p:  13/5101  forms  items 

33  must#m#l 

269  do  not#x#l 

286  and#j#l 

609  that*d=l 

667  when#q#l 

1114  you#p#l 

8  should-m#l 

132  to#xs0 

129  if=j#l 

480  this#d#l 

505  where=q#l 

969  we#p#l 

38  per=x??l 

116  because=j=l 

84  these#d#l 

503  who#q#l 

601  he#p£l 

10  of&x#l 

105  or#j#l 

47  those=d=l 

498  what#q#l 

587  I#p#l 

7  passive#x#l 

38  until#j=l 

39  this_one#d#l 

435  how_many?q#l 

577  thev#p#l 

37  but#j#l 

241  how#q#l 

392  she=p= 1 

203  how  much#q#l 

250  it#p#l 

185  why#q#l 

213  some#p#l 

174  which#q#l 

147  others=p=  1 

43  how  few£q#l 

86  my=p=l 

34  from_where#q#l 

77  his=p#l 

51  oneselfr?p#l 

37  your?p#l 

r:  75/5616  forms,  items 

j  k:  147  13094  forms  items 

a:  484  64197  forms,  items 

v:  1177/131383  forms  items 

n:  1982.  228641  forms  items 

341 

before*r#l 

I  582 

fat#k#l 

766 

hot#a#l 

771 

CUt#V#l 

1334  rice#n#l 

267 

no#r#3 

j  582 

son#k#l 

752 

true-a~l 

693 

fall#v#l 

854 

chicken#n#2 

226 

on#r#0 

j  577 

hus#k#l 

690 

quick#a#l 

639 

bite#v#l 

769 

person#n#l 

222 

thus#r#2 

!  541 

mot^k-l 

685 

near#a#l 

635 

know#v#l 

695 

sarong=n#  1 

186 

below#r£l 

!  529 

wif#k#l 

618 

old#a#l 

622 

weep#v#l 

692 

year#n#  1 

182 

in  front=r~  1 

|  493 

Ison#k#l 

617 

full#a#l 

612 

smell#v#l 

689 

paddv=n#3 

179 

after^r^l 

!  322 

Osis.fem#k#l 

605 

fat^a-l 

604 

tie#v#l 

688 

tree#n#l 

175 

agam-r=l 

!  303 

dau-k#l 

602 

cold#a#l 

595 

lift#v#l 

676 

kmfe=n=l 

168 

above#r*i2 

!  279 

Idau#k#l 

569 

dry#a#l 

588 

returnin'#! 

667 

louse#n#l 

163 

behmd^r?  1 

j  241 

Ybro=k^l 

568 

narrow#a#l 

583 

dig#v#l 

649 

leech#n#l 

145 

not*r#l 

|  236 

fat.par#k#l 

563 

far#a=l 

583 

steal#v#l 

639 

meat#n#l 

141 

up#r#l 

I  227 

mot.par#k#l 

552 

hungry#a#l 

580 

give#v#l 

637 

stone#n#l 

140 

downer#  1 

!  218 

chi.chi#k#l 

552 

wrong#a#l 

577 

fly#v#l 

627 

house=n=l 

138 

always#r#2 

!  214 

chi#k#l 

550 

bitter#a=6 

576 

love#v#3 

621 

night#n#l 

122 

from#r#0 

j  200 

Ysis.fem#k#l 

542 

good=a- 1 

571 

throw#v#l 

617 

lightmng#n#l 

117 

never#!#  1 

|  200 

Ysis.mal#k#l 

537 

wide#a#l 

570 

bring#v#l 

610 

hair#n#l 

Syllable  table  Segments  (onset,  nucleus,  coda)  can  be  viewed  in  the  context  of  single  languages 
(as  below),  or  as  large  comparative  tables.  Below,  vowel  segments  -  and  the  syllables  they 
appeal'  in  -  from  two  Kra-Dai  languages  (from  Hudak  2008): 


!-  |  AJ1J  (1) 

o: 

naih33,,. 

do:k12(1) 

ho:k12a) 

mo:k12(1) 

nD:kl2(1) 

SDj(12a, 

tD:k120) 

■P*12,,, 

?D:k12a) 

kDip12,,, 

j3:t33a, 

in 

e 

mem31(1) 

tem12,,, 

kan“a) 

Icn33(i  j 

CEn‘2a, 

hen12,,, 

Pen11,,, 

k-ai55,,, 

ten55,,, 

ven55,,, 

xt 

e: 

l£:k33ai 

be:k1!(1) 

tr:k12a) 

xc:k12(1) 

?f;k12(i) 

mc:P33(i) 

hen33,,, 

dw12,,, 

pen12,!, 

kp»21 

ne.  (1) 

1C:21d) 

pi 

V 

kd21„, 

nd12,,. 

Pd12al 

mrj55(a 

Pd55,!, 

’"J55,!, 

dvk55,,. 

?vk55„, 

lrm53,!. 

b'"n21,2. 

tH-n12c,) 

d 

n 

c**33,,, 

lv:k33(1) 

Dick33,,, 

he*13,,, 

PX*12,,, 

P*‘x:k12(J) 

rjv:k12„) 

lYrt33^ 

dY,t12,!, 

hv:21^ 

mr:21^ 

X 

Ul 

luik33a) 

dmk55a) 

srnk^d) 

thrak55(1) 

lumi21a) 

jnnn55,,, 

ft™2',!, 

mum21(1) 

suin21(1) 

xuin21(1) 

<Jt™21a> 

m 

tu: 

xui:p33(1. 

muct33(1) 

jm;t12(1, 

jtu:21a, 

mnc31 

t™21,!) 

cm:33,,, 

sun33(1) 

muc31  a, 

sm;31a, 

h 

?  main  menu  s 


a  pa(n 

eaj33,,, 

haj33,,, 

haj33,,. 

laj33,,, 

naj33,,, 

Phaj33<„ 

taj33,i, 

xaj33,,, 

eaj31,,, 

maj31,,, 

nhai45 

P  aJ  O) 

saj45,,, 

taj45,,, 

‘“aj45,,, 

xaj45„, 

hak33,,, 

lak33,,, 

mak33(1) 

Phak33,„ 

sak33,,, 

eak45,,, 

ban45,, 

kan45,,, 

k"an45(1) 

man45cl) 

pan45,,, 

tan45,,, 

xan45,,, 

hap33,,. 

lap33,,, 

nap33,,, 

eap45,,, 

haw45,!. 

haw45,,, 

phaw15(1, 

saw45,,, 

xaw45,2, 

eai)33,,, 

nai)33,,, 

tat)33,,. 

xar)33a, 

Jai)31,i, 

eai)342,,, 

a:  tjaij33,,, 

ha:j31,i) 

sad31,,, 

ba:j342a, 

ra-i342 
ca*J  (2) 

datj342,,, 

ia*i342 

Ja*J  (1) 

kwa:j342,„ 

N342^, 

^a-i342 
sa*J  U) 

xa:j342n, 

v,  342 

j-  |  xwann 

da:mn(1) 

xa:mn(1) 

ha:m45(1) 

la:m45a, 

na:m4Sm 

45 

sa:m4b(1) 

t-ann45,,, 

xa:m4S(1) 

xa:n31,„ 

fain342,,, 

baiw12a) 

pa:wI2(1) 

ja!wucl) 

ha:w45a) 

k“a:w45a, 

nanv45(1) 

sa.-w45,,. 

xa:w4S(i) 

ha*33 
na.  (1) 

ia*33 

J3*  (i) 

ha:33,,, 

j.  I  fho.ll 
.  (1) 

xa!!1c 

7a»li 

ra.  (1) 

earn33,,, 

ha:!)33,,, 

ka:o330, 

la:r)33„, 

x"a:i)33(1J 

ca:i331,,, 

lamM„, 

ma:!)31,,) 

a  cS,,, 

xa,i, 

e  dek45a) 

lek45,,, 

semlla> 

sem45,!, 

ten342,,, 

pen12,,, 

sen11,,, 

leP33,,, 

cep45,,, 

hep45„. 

hep45,,, 

e:  veik33,, 

ie;P12.’. _ 

ien12,,, _ 

xe:tl3.„ 

Ie=342... 

me:342,,. 
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Segment  table  We  sometimes  need  to  look  at  syllable  components  in  order  to  understand  their 
distribution  (from  a  linguistic  point  of  view),  or  as  a  more  practical  matter,  to  help  explain 
apparent  gaps  in  the  source  notation  -  differences  between  two  lects  may  be  real,  or  they  might 
just  be  a  consequence  of  the  field  worker’s  notation.  Below,  a  sample  from  a  complete  set  of 
onsets  for  all  23  Hmong-Mien  languages  in  Wang  1995.  These  have  been  ordered  longest-onset- 
first;  other  options  include  alphabetical  order  and  frequency.  The  colored  cells  account  for  more 
than  5%  of  a  given  language’s  total: 


j‘ 

(6)  jw(5)  J  (9) 

kl(6S) 

kh  (59) 

ka  (2) 

k*  (119) 

k*  (65)  1“  (14) 

l1  (29)  P*  (8) 

tn*  (12)  mf  (29) 

Iir  (27) 

ip  (64)  n*  (6)  n*  (9)  nw  (5) 

n  (96 

pi  (66)  pi  (3)  p11  (87)  p*  (9) 

P*  (103) 

wangl 995mlao  I 

'Pa, 

P*m 

wangl  995ml  ao  2 

k"(4) 

“’a, 

*1*0) 

9(4) 

P“(» 

P’(l) 

wangl  99.*»mUo  3 

.Taw 

pU  p‘,., 

wangl995miao  4 

i*CT 

me(1) 

■Pn,  n‘ai 

9ai 

P'a, 

wangl  995mlao  5 

k*„ 

TV*, 

9m 

Pin  P‘,., 

P’a, 

wangl 995mlao  6 

I'm 

9:«) 

Pi,.)  P’s, 

wangl  995miao  7 

P's,  Pi. 

P>„, 

wangl995mlao  8 

k^rr. 

m*a) 

qv4> 

9m 

Pin  P‘,., 

P’a, 

wangl99Smiao  9 

kV, 

ni'is. 

9ai 

Pl.i  P",., 

wangl995miao  10 

k'o) 

k%) 

k"w 

“’ll) 

«Pi.j 

9ai 

P\», 

PW 

wangl 995mlao  11 

Jq> 

***13 

“m 

k-m 

F(4> 

m-’c 

TV? 

9(4) 

P‘,., 

P’a, 

wangl995m!ao  12  js 

k*a. 

U  1i 

*  (4)  1  (T) 

m6,^ 

■Pa,  »*« 

9a, 

P‘,n  P ‘it, 

p1,.. 

wangl 995miao  13 

kin 

^(•J 

k’cn 

k"f»> 

«Pm  n’cn 

9ai 

PU  P‘,., 

P’w 

wangl 995mlao  14 

^nn 

k’asi 

k% 

In’® 

P*ai  | 

Prill 

wangl  995ml  ao  15 

Jwm 

kwm 

P,« 

m’a> 

mwa> 

*0,  ^ni 

9a, 

p-,., 

P’fW 

wangl995miao  16 

k*« 

^a> 

k"a> 

^  w  lwai 

“wa> 

'Pa; 

9a) 

p‘,.. 

!>«> 

wangl 995miao  17 

ra 

Ur, 

k’n 

k'(»j 

1*01 

n>'„, 

m"c»> 

«Pl4) 

9u) 

pL»j  P*cn 

P’a, 

wangl  995mlao  18 

J" nr  ini 

k!cm 

k"r5i 

I'm 

m>n) 

m’ni 

n"ni 

plai  P* ro 

P’05 

wangl995mlao  19 

k’a«) 

k% 

®’oi 

p*p>  r 

P’oo) 

wang 1995ml  ao  20 

u., 

k'o*) 

k% 

P*o> _ 1 

P’a, 

wangl 995miao  21 

u 

U,,., 

kha 

k»w 

k"(»j 

1% 

w/a) 

'Pw 

9a, 

PU,  p\g, 

P>, 

wangl 99.r>mlao  22 

Jm 

^n*> 

k*Vr 

Fat  I’m 

m1® 

Pla«  P* ai 

P’a, 

wangl995mlao  23 

*’a> 

Fo 

“’a, 

P’a, 

These  cells  are  also  actionable;  below,  the  /  nfeh/  onsets. 


an  k“"’  (2) 

k,v>  (6)  1"’  (3) 

m'fi  (1)  i  m“  (1) 

in'  (13) 

m*  (3)  rT  (3)  n“  (2)  plJ  (1)  phl  (7)  p*1'  (11) 

ph,v  (10) 

wangl  995miao  1 

wangl995miao  2 

wangl  995miao  3 

wangl  995miao  4 

%  Mozilla  Firefox  -  □  X 

192.168.1. 1B5/project/darpa/lookup.pl?showEditable=on&formOButton=&£  ® " 

9  items  found,  9  items  returned  in  1  seconds  (note  limit  of  50  items  per  doculect) 

Limit  per  doculect  is  automatically  raised  to  50  for  double-click. 

The  table  below  can  be  edited  in  place,  and  copied  and  pasted  into  a  spreadsheet 

Double-click  on  any  row  to  delete  it  completely.  For  more  features  (e.g.  gloss.  ISO)  unclick  the 
Editable  view ...  only  box  on  the  right,  then  click  search  MetaForm 

copper  silver  bibref 

n?tshe31  "ts1^31  wangl  995miao  8 

(2) 

(i) 

wangl  995miao  5 

wangl  995miao  6 

wangl  995miao  7 

wangl  995miao  8 

D0_ _ 

ru _ 

[2) 

kfi” 

wangl99Smiao  9 

n?tshoi)31  "ts*'or)31  wangl995miao  8 

- 

wangl995miao  10 

n?tsha55  ntsha55  wangl  995miao  8 

wangl995miao  11 

n?tshu24  "ts'Hr4  wangl 995miao  8 

n?tsha31  ntsha31  wangl  995miao  8 

. 

n?tshe31  “tshe31  wangl  995miao  8 

n?tshu‘5  ”tshus5  wangl995miao  8 

n?tshen55  °tstlen55  wangl 995mlao  8 

«_ 

» 

wangl  995miao  12 

wangl  995miao  13 

wangl  995miao  14 

(i) 

wangl  995miao  15 

nhw 

P  (2) 

[1) 

a) 

wangl995miao  16 

nhv* 

P  (2) 

wangl995miao  17 

nhw 

P  (2) 

[2) 

1  n?tshe24  ntshe24  wangl  995miao  8 

wangl  995miao  18 

nh" 

P  (2) 

Seg  summary  Below,  an  overview  of  all  onset,  nucleus,  coda,  and  tone  sequences,  sorted  by 
frequency  with  the  50  min  option  selected.  Note  that  Zipf’s  Law  holds  -  frequent  items 
dominate.  Ignoring  very  low  frequency  items  deals  with  noise  (which  can  usually  be  traced  to 
eiTors  in  the  original  data),  while  having  minimal  impact  on  the  size  or  representativeness  of  the 
full  database. 
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These  sequences  are  drawn  from  863528  syllables  from  444143  words  in  558  doculects 
The  presence  of  leading  raised  consontants  reflects  an  attempt  to  stay  true  to  raw  sources, 
while  providing  a  hint  as  to  how  they  should  be  analyzed.  In  general,  leading  raised  consonants 
would/could  probably  be  lowered  and  treated  as  syllabic  segments  (in  some  cases  with  an  unwritten 
epenthetic  schwa,  depending  on  the  exact  sequences)  Some  semi-analyzed  data  is  included  here, 
so  some  irregularity  (such  as  apparent  tone  marks  in  vowel  sequences)  may  be  encountered 
Below,  all  columns  are  sorted  using  the  specified  reverse  option. 


Onset:  1734  distinct  items,  805655  raw  total,  294  distinct  >=  50,  794050total  98  5%  of  distinct  items  shown 

Nucleus:  720  distinct  items,  860330  raw  total,  190  distinct  >=  50,  856137  total  99.5%  of  distinct  items  shown 

Coda:  330  distinct  items,  328438  raw  total,  54  distinct  >=  50,  326843  total  99.5%  of  distinct  items  shown 

On-coda:  1963  distinct  items,  1134093  raw  total,  309  distinct  >=  50,  1121494  total  98.8%  of  distinct  items  shown 


Tone: 

95  distinct  items. 

253 143  raw  total,  52  distinct  >=  50,  2528 1 0  total  99.8%  of  distinct  items  shown 

rank 

Onset 

items 

Nucleus 

items 

Coda 

items 

On  +  coda 

items 

Tone 

items 

1 

k 

65630 

a 

209363 

0 

61298 

n 

94386 

ss 

51435 

2 

t 

65308 

i 

104752 

n 

48796 

k 

93841 

33 

35346 

3 

1 

60476 

u 

94240 

j 

38381 

t 

85945 

31 

27134 

4 

m 

55043 

o 

74691 

7 

30258 

m 

82535 

S3 

16543 

5 

p 

46695 

a 

62108 

k 

28211 

0 

77297 

4 

14900  | 

6 

n 

45590 

e 

57233 

m 

27492 

1 

67009 

35 

13350 

7 

s 

38401 

a 

52444 

w 

21379 

j 

58493 

2 

11333 

8 

r 

32283 

0 

24949 

t 

20637 

p 

57747 

21 

11134 

9 

b 

28895 

a: 

23500 

p 

11052 

? 

57347 

13 

9254 

10 

? 

27089 

c 

18866 

r 

8803 

S 

42044 

11 

7204 

For  our  own  data  audit  purposes,  sorting  by  the  number  of  Unicode  characters  in  the  sequence 
is  more  useful,  since  longer  sequences  are  more  indicative  of  error,  e.g.  tone  “2323”  in  the  second 
row.  The  second  onset,  /ntgh/,  is  a  subtler  error  -  when  Ini  was  raised  to  indicate  prenasalization, 
the  /tg/  affricate  was  not  properly  recognized.  The  onset  at  rank  10  has  an  equivalent  problem. 
This  occurs  because  the  IPA  and  Unicode  do  not  treat  all  affricates  in  the  same  way.  We  fixed 
the  whole  class  of  errors  with  a  minor  code  tweak  that  ‘unifies’  some  two-character  affricates  that 
do  not  have  pre-built  digraphs. 


rank 

Onset 

items 

Nucleus 

items 

Coda 

items 

On  +  coda 

items 

Tone 

items 

1 

kkhx 

116 

31 

314 

J 

544 

“kht 

116 

11-21 

173 

2 

mgh 

76 

ea 

308 

j 

440 

msh 

76 

2323 

419 

3 

tgh 

1157 

d: 

251 

n 

230 

t?h 

1157 

231 

3174 

4 

khr 

647 

u: 

250 

143 

kh- 

647 

214 

915 

5 

Phj 

588 

ia 

198 

m 

111 

P* 

588 

213 

859 

6 

kh. 

559 

e: 

182 

gs 

110 

khj 

559 

132 

610 

7 

phr 

330 

ua 

167 

j? 

108 

phr 

330 

4S4 

602 

8 

khw 

311 

pa 

156 

l? 

80 

kh" 

311 

343 

446 

9 

mbw 

297 

i: 

153 

mb 

77 

mb” 

297 

314 

380 

10 

288 

o: 

150 

0? 

68 

n\ 

288 

112 

325 

11 

m^h 

269 

a: 

142 

e? 

66 

m^h 

269 

323 

240 

12 

khl 

242 

ua 

130 

0? 

63 

khl 

242 

312 

239 

13 

kkh 

230 

o:y 

126 

P 

62 

kkh 

230 

342 

204 

14 

Phl 

201 

a: 

118 

*9 

59 

Phl 

201 

232 

87 

15 

n^h 

188 

c: 

116 

t° 

58 

Dfh 

188 

545 

84 

16 

“1“ 

175 

ua 

103 

P" 

50 

kfh 

175 

212 

50 

17 

Ukl> 

174 

u: 

96 

0 

61298 

ok1* 

174 

55 

51435 

18 

"tSh 

173 

ia 

95 

n 

48796 

“ts6 

173 

33 

35346 

19 

ntr 

173 

ua 

95 

j 

38381 

°tr 

173 

31 

27134 

20 

mkh 

145 

91 

? 

30258 

mkh 

145 

53 

16543 
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Cover  &  contrast  These  tables  help  describe  each  language’s  internal  variation,  and  also 
provided  minimal  datasets  that  are  extremely  useful  for  testing  downstream  applications.  Below, 
we  see  the  distinct  onset,  nucleus,  and  coda  segments  found  in  the  complete  dataset,  followed  by 
a  list  of  34  words  that  use  them  all  in  context.  Because  it  is  provided  by  a  computationally 
feasible  greedy  set  cover  algorithm  it  is  very  likely  (but  not  certain)  to  be  the  smallest  such  set. 


Ill  words  in  list,  34  words  required  to  cover  huffmanl  971  vocabulary  3 
Tokens:  onset  x36,  nucleus  x33,  coda  xl2 

Onset:  b  c  ch  d  f  h  j  j  k  kl  kr  kh  kh"  kw  1  m  m  n  n  p  pi  pr  ph  ph-’  p*  r  r  s  t  t*1  w  rj  1  ji  ?  ?n 

Nucleus:  a  ae  ao  ao  ae  a  at  e  ea  e  ea  i  i  o  oa  oe  o  oa  u  ua  u  a  ae  at  a:  a  d  d  a  ae  a  £  c 


!  Coda: 


chjklmnptrjji? 


sarj|?oe 

anus#n#l 

ta|cap 

arrive#v#l 

ha  dua 

at#r#0 

ba:rj|kon 

birth#t#l 

ha  |  ret  |  mot 

blink#v#l 

chim 

blood#n#l 

k"i 

cait#n#l 

kb  jat  pajla? 

clothing#n#l 

lac 

copper  #n#l 

kap|tho? 

cover  #v#l 

krac 

deer#n#l 

?a|ca|ha|?uj 

diviner#n#l 

ka[pai) 

dr>_season#n#l 

fah 

fever  #n#l 

poa|waji 

game#n#l 

ha  nac  tap  nih 

grave#n#2 

sac | peak 

gieen?»a#l 

num  |morj 

have#v#l  |exist#v#l 

mac 

hook#v#l 

teak  |  pa  |  khak  |  la  |  cal 

knot#n#2 

Contrast  Below  minimal  contrast  sets  for  the  same  lect.  They  are  unusual  in  contrasting  full 
onset,  nucleus,  and  coda  segments.  Hovering  over  anynumbered  cell  reveals  the  contrasting 
items.  On  the  right,  we  see  the  contrast  of  various  consonant  codas  with  open  vowel  finals. 


Assemble  for  download  This  provides  variations  on  tsv,  xml,  and  htm  views  in  which  more  or 
less  metadata  is  provided.  For  example,  this  is  an  htm  view  of  all  18  lects  in  this  source 
(columns  are  glosses,  rows  are  lects). 


1971  Huffman,  Franklin  Unpublished  vocabulary  lists  Huffman  Papers,  sealangnet/archives/huffman 


Row 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

copper  gloss 

one 

two 

three 

four 

five 

six 

seven 

eight 

nine 

ten 

silver  gloss 

one#n#l 

two#n#l 

three#n#l 

four#n#l 

five#n#l 

six#n#l 

seven#n#l 

eight#n#l 

nine#n#l 

ten#n#l 

1  [khm]  Central  Khmer 

muaj 

pi: 

baj 

buan 

pram 

pram  |  muaj 

pram  |  pi: 

pram  |  baj 

pram  buan 

dap 

2  [lcp]  Western  Lawa 

ti? 

la|7a 

la|?ue 

paon 

phan 

lch 

?a|kh 

sle? 

staem 

. 

3  [mnw]  Mon 

moa 

ba 

pae? 

pan 

pa  sap 

ka|raa 

ha  |  pah 

ha  |  cam 

ha|cic 

cah 

4  [mnw]  Mon 

mua 

ba 

pac? 

pan 

ajsaji 

a  raa 

ha  |  pah 

ha  |  cam 

ha|cic 

cah 

5  [cbn]  Nyahkur 

muaj 

ba:r 

pi:? 

pan 

c“u:n 

tiaw 

“pah 

^catm 

^ciit 

cas 

6  [mlf]  Mai 

ma|?an 

piaw 

plic? 

pho:n 

7  [kjg]  Khmu 

mo:j 

ba:r 

8  [kdt]  Kuy 

mu:j 

bia 

paj 

pa:n 

sa:rj 

ta  pat 

ta|pbol 

tajkual 

ta|kch 

pcat 

9  [kdt]  Kuy 

muaj 

bail 

paj 

pom 

sa:q 

ta|phat 

ta|pho:l 

talked 

ta|khe:h 

ma|cit 

10  [bru]  Eastern  Bru 

muaj 

ba:r 

paj 

pu:n 

si:rj 

ta|pat 

ts|pu:l 

tajkual 

ta|ke:h 

ma|cit 

1 1  [ngt]  Ngeq 

ma:c 

ba:r 

pc: 

puan 

sa:r) 

tha|phat 

t6a|pho:l 

ta|ka:l 

ta|kias 

ma|chit 

12  [alk]  Alak 

mo:j 

phau 

paj 

pom 

phar  1  tham 

ta|raw 

tarn  |  pah 

ta  rja:m 

tarj|ci:n 

c^it 

13  [lbo]  Laven 

mu:j 

ba:r 

pc: 

puan 

sa:rj 

traw 

pah 

^atm 

cim 

cet 

14  [brb]  Lave 

mu:j 

ba:r 

pe: 

puan 

c“a:i] 

traw 

pah 

tha:m 

cem 

cit 

15  [sti]  Bulo  Stieng 

muaj 

ba:r 

pe: 

puan 

pram 

praw 

pah 

pha:m 

Jen 

ja|niAt 

16  [tpu]  Tampuan 

maoji 

pHar 

paerj 

pwan 

pa|tam 

trao 

tim|paoh 

U|qa;m 

^in 

Pcbjt 

17  [cog]  Chong 

muy 

pay 

phe:w 

pho:n 

phram 

ta:g 

nu:j 

ti: 

ca:j 

ra:j 

18  [pcb]  Pear 

moj 

pea:k 

phek 

pha:|on 

phra:m 

krai) 

“null 

kra|ti 

kamsear 

ra:j 

Sets  can  be  rotated  in  place  for  easier  browsing.  Below,  columns  are  lects  and  rows  are  glosses: 


1995  Tryon,  Darrell  1  (ed  )  C 
Row  copper  gloss 

1  world 

2  earth,  land 

3  earth-ground,  soil 

[4  'dust 

5  mud 

omparative  Austronesian  Dictionary  An  Introductio 
silver  gloss  1  [tay]  Atayal 

world#n#l 

land#n#4  rauq 

soil#n#2  rauq 

to  Austronesian  Studies  Berlin  Nev 
2  [tsu]  Tsou  3  [dru]  Rukai 

*pix|pi|gi  daa|daa 

tsroa  daa 

ron|pu|xu  0o|vo|go 

dig|ki  i|li|tsi 

York  De  Gruyter  Mouton 

4  [pwn]  Paiwan  5  [tao]  Yami 
ka|ju|na|gan  ka  za|wan 

qi | pu  ta  na 

qu|na|vu|Pan  ta  na 

6  [isd]  Isnag 
ka|la  wa|ga:n 
lu|sa? 
lu|sa? 

dust#n#l 

mud#n#l 

7a|PUj 

Ijfaq 

li|tsaq 

vu|das 

lig|bo 

a|tak 

ta:|pu? 

lu|pag 

6 

7 

8 

sand 

mountain,  hill 
cliff,  precipice 

sand#n#l 

mountain#n#l 

cliff#n#l 

Pu|na|qij 

ra|yi|jax 

4u|hij 

fur|fu|?u 
fur  gu 
ti?jni 

a|naj 

Ulanllslfls 

to|ka|4a|na 

ga|du 

da  ija  da  rjas 
quma 

anaj 
to  kon 
a|la|^a§ 

gi|nat 
ban  taj 
ba|gi 

9 

plain,  held 

plain#n#l 

quij 

’’relsarjlsi 

<[a|ta  na 

ka|za|ta|jan 

ir|?ir|?er 

10 

valley 

valley  #n#l 

u|qu? 

do  ka|so|pi  tan  no  to|kon 

ta|na:p 

11 

island 

island#n#l 

ma  ua  taw 

pu|gu 

12 

mainland 

mainland#n#l 

Poql§o 

13 

shore 

shore#n#l 

fay 

ba|ba|bi  la 

filjvu 

ka  na|na  jan 

dap  |  pit 

14 

cave 

cave#n#l 

kury 

o 

p 

ba|ro|go|lo 

za|k'um 

ar|lfip 

ab|but 

15 

water 

water  #n#l 

qu|ji|ja7 

°xu|mu 

a  |  tsi  |  laj 

Pa|vak 

za  nom 

da|num 

16 

sea 

sea#n#l 

Pa|ru? 

d|pi 

ba|jo 

ta|vak 

a|wa 

be  |  baj 

17 

calm  (of  sea) 

calm#a#2 

ma|tra|nag 

na|la|naj 

18 

rough  (of  sea) 

roiling#a#l 

bu|tsaq 

mar|  4a 

nag  da|wal 

19 

foam 

foam#n#l 

Pa||)ut 

fro  |  si 

[a|po  tso 

o|tab 

bu:|ga? 

20 

ocean 

ocean#n#l 

Pa|ru? 

t»ipi 

a|wa 

be  |  baj 

21 

lake 

lake#n#l 

wa|tji|iurj 

ba|jo 

[a|cuk 

mi |bab  nag  a  za|noin 

a|baj  ja  pi|sug 

22 

gulf,  bay 

bay#n#l 

wa|wa 

sul  |  bog 

23 

lagoon 

lagoon#n#l 

wa|wa 

pi|sug 

2.1 

reef 

reef#n#l 

kaj  sa  kan 

25 

headland,  point 

cape#n#l 

pamlsan 

puplpulgu 

26 

wave 

wave#n#l 

ni|na|waj 

smut|6uk|6u|ku|ru 

bi  j  ka  |  bi  ki 

([alruP 

am  lo|ko  lo|kog 

bal|  nag 

27 

tide 

tide#n#l 

28 

lowtide 

low  tide#n#l 

mam|t$i 

29 

high  tide 

high.tide#n#l 

manap 

«« 

»»  ■ 

• . 

The  xml  and  tsv  views  are  the  basis  of  the  project’s  data  distributions.: 
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< dataset  id  =  "huffmanl971  vocabulary.cl" > 

<  metadata  > 

<  reference  > 

<  id  >  huffmanl  971  vocabulary  <  /id  > 

<  doi  >  15144/huffmanl  971  vocabulary  <  /doi  > 

<  creator  >  Huffman,  Franklin  <  /creator  > 

<  title  >  Unpublished  vocabulary  lists  <  /title  > 

<  date  >  1 97 1  <  /date  > 

<  publisher  >  Huffman  Papers,  sealang.net/archives/huffman  < /publisher  > 

<  lects  >  1 8  <  /lects  > 

<  /reference  > 

<  language  > 

<  languageCode  scheme  =  "iso639-3"  >  khm  <  /languageCode  > 

<  languageName  scheme  =  "iso639-3"  >  Central  Khmer  <  /languageName  > 

<  latLong  source  =  "Ethnologuel  8"  >  1 2.4671,104.5699  <  /latLong  > 

<  latLong  source  =  "Glottolog2.6"  >  12.0515,105.015  <  /latLong  > 

<  country  source  =  "Ethnologuel8"  >  Cambodia  <  /country  > 

<  country  source  =  "Glottolog2.6”  >  Cambodia  <  /country  > 

<  adm  level  =  "1 "  source  =  "Ethnologuel  8"  >  Kampong  Chhnang  <  /adm  > 

<  adm  level  =  "1"  source  =  "Glottolog2.6"  >  Kampong  Cham  Province  <  /adm  > 

<  population  source  =  "Ethnologuel  8"  >  14224500  <  /population  > 

<  /language  > 

<  doculect  > 

<  id  >  huffmanl  971  vocabulary.cl  </id> 

<doi>  15144/huffmanl  971  vocabulary.cl  </doi> 

<  creator  >  CRCL  <  /creator  > 

<  date  >  2015  <  /date  > 

<  notation  >  IPA  <  /notation  > 

<  analysis  >  broad  <  /analysis  > 

<  forms  >  887  <  /forms  > 

< /doculect  > 

<  /metadata  > 

<  data  > 

<item  id  =  "huffmanl 971  vocabulary:C:cl.rl.gsl495.i2"  iso639-3  =  "khm"  lang  =  "Central  Khmer" > 

<  forms  > 

<  form  status  =  "copper"  analysis  =  "broad"  script  =  "IPA"  >  muaj  <  /form  > 

<  form  status  =  "silver"  >  musj  <  /form  > 

<  /forms  > 

<  glosses  > 

<  gloss  status  =  "copper"  lang  =  "eng"  >  one  <  /gloss  > 

<  gloss  status  =  "bronze"  >  one  <  /gloss  > 

<  gloss  status  =  "silver"  >  one#n#l  <  /gloss  > 

<  /glosses  > 

<  /item  > 

<item  id  =  "huffmanl 971  vocabulary:C:cl. r2.gsl562.il 9"  iso639-3  =  "khm"  lang  =  "Central  Khmer" > 

<  forms  > 

<  form  status  =  "copper"  analysis  =  "broad"  script  =  "IPA"  >  pii  <  /form  > 

<  form  status  =  "silver"  >  pi:  <  /form  > 

r, -- 


Colexification  attempts  to  identify  universals  related  to  semantic  shift  and  inherent  polysemy  and 
heterosemy.  Below,  we  look  at  Tryon  Austronesian  (a  collection  of  80  lects,  here  numbered); 
identifying  all  semantic  pairs  that  are  expressed  with  the  same  word  in  multiple  languages. 


die#v#l  kill#v#l 


11 1 12  1 12 1 14 1 15 


difficult#a#l  hard#a#3 

difficult#a#l  heavy#a#l 

dinner#n#l  supper#n#l 

disappear#v#l  lose#t#l 

dish#n#2  plate#n#4 

ditch#n#l  furrow#n#l 

divide#t#l  separate#t#2 

divide#t#l  share#v#4 

do#v#l  work#n#l  8 | 12 | 13 | 14 | 36 | 70 | 80 

donkey#n#2  mule#n#l  5 | 18 | 19 | 29 | 60 

doorpost#n#l  pillar#n#5 |pole#n#l  16 | 17 | 21 | 33 | 36 | 47 | 48 | 60 | 75 | 79 
dormitory#n#2  :male  meeting_house#n#0  41 1 43 | 45 | 48 | 49 | 59 1 60 | 61 1 63 | 64 | 78 

down#r#l|below#r#l  low#a#2  7 | 14 | 16 | 40 | 44 | 55 | 61 | 62 

down#r#l | below#r#l  under#a#l  8 | 9 1 11 1 12 1 13 1 14 1 16 1 18 1 19 1 20 | 21 1 22 | 24 1 29 1 30 1 31 1 32 1 33 | 34 1 35 1 36 1 43 | 45 | 49 1 50 | 51 1 57 1 62 | 63 1 71 1 76 1 77 | 78 | 80 


1 | 40 | 52 | 59 | 60 | 61 | 63 | 67 | 70 | 78 
33 | 42 | 43 | 45 | 48  I  49 | 50 
1|5|20|21|22|23|29|41|70 
8 | 11 | 14 | 18 | 27 | 29 | 32 | 35 | 46 | 75 | 79 | 80 

2 | 5 | 6 | 8 | 9 1 10 1 12 1 13 1 14 | 21 1 22 1 23 1 26 1 28 | 35 | 48 | 51 1 52 | 54 | 62 | 63 | 63 | 64 | 65 | 66 1  67 
2 | 28 | 69 | 70 | 71 | 80 

16 | 17 | 36 | 40 | 43 | 47 | 48 | 61 | 73 | 75 | 77 | 78 

11 1 14 1 15 1 17 1 18 1 19 | 20 1 21 1 23 | 27 1 28 | 30 | 32 | 36 1 37 | 39 | 41 1 52 | 56 | 59 | 60 | 61 1 62 | 64 | 66 | 69 | 71 1 75 | 80 


drip#v#l 

river#n#l 

water #n#l 

fall#i#l 

release#t#l 

sink#i#4 

goose#n#l 

remain#v#2 


dribble#v#4 
drink#n#3 
drink#n#3 
drop#t#l 
drop#t#l 
drown#i#3 
duck#n#l 
dwell#v#3 
dwell#v#3 
dye#v#l  paint#n#l 
dye#v#l  paint #v#2 
early#a#l  quick#a#l 

early#a#l  soon#r#l 

earn#v#l  find#v#3 

earn#v#l 

easy#a#l  light#a#l 

eat#v#l  food#n#l | food#n#2 
eat#v#l  meal#n#2  11 | 19 | 41 | 51 | 64 | 65 | 73 | 74 | 76 

edge#n#l  side#n#l  4 | 8 | 12 | 14 | 24 | 51 | 67 | 78 

egg#n#2  testicles#n#l  1 | 28 | 34 | 58 | 63 

empty#a#l  zero#n#2 | nothing#n#l  1 | 19 | 25 | 35 | 38 | 43 | 49 

end#n#l  end#n#2  15 | 20 | 21 | 32 | 37 | 50 | 51 | 52 | 60 | 62 | 78 
end#n#2  finish#v#l  11 | 25 | 30 | 34 | 43 | 49 | 70 

end#n#2  last#a#2  10 | 16 | 22 | 25 | 27 | 35 | 57 

end#n#2  stop#v#2  43 | 49 | 52 | 63 | 69 | 71 


7 | 11 | 29 | 31 | 33 | 35 | 45 | 56 | 80 
42|43|44|44|51|59|59 
17 | 41 | 42 | 43 | 44 | 50 | 51 | 59 | 70 | 80 
7 | 9 | 11 | 12 | 15 | 29 | 32 | 34 | 35 | 44 | 54 | 55 
33 | 40 | 46 | 46 | 61 | 62 

19 | 20 | 24 | 33 | 35 | 36 | 38 | 39 | 40 | 47 | 51 | 54 | 61 | 62 | 63 | 66 | 80 
1|39 | 42 | 56 | 62 | 75 

6 | 15 | 19 | 20 | 21 | 25 | 26 | 27 | 29 | 30 | 31 | 33 | 36 | 37 | 39 | 42 | 43 | 48 | 49 | 50 | 52 | 61 | 62 | 63 | 64 | 65 | 71 | 72 | 73 | 74 | 76 | 78 | 80 
sit#v#l  16 | 17 | 32 | 34 | 39 | 40 | 43 | 45 | 46 | 47 | 48 | 57 | 76 | 78 | 80 
17|24|40|43|60|62|63|77 
16 | 17 | 61 | 67 | 77 

13 | 34 | 44 | 60 | 73 | 74 | 77 | 78 | 79 
5 | 7 | 72 | 75 | 77 
12|34 | 35 | 36 | 52 
get#v#l  16 | 34 | 35 | 41 | 41 | 44 | 57 | 61 | 73 | 76 | 79 
34 | 36 | 42 | 45 | 48 | 52 | 61 
11 1  65 | 68 | 71 1 80 
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Semantics  A  question  that  arose  for  LORELE  applications  was  whether  the  small  lexicons  this 
project  is  based  on  would  have  relevant  semantic  content.  This  feature  looks  at  various  measures 
of  sentiment  as  applied  to  the  Tryon  Austronesian  list. 


Sentiment  lot  tryon  1995comparative  yloss  list  Using  lust  items,  without  attributes  (so  youwnan1\sottrtafr1  ->  yourujUaHI) 

SentiWofds  Guerini  M  .  Gatti  L  &  Turchi  M  'Sentiment  Analysis  Howto  Derive  Prior  Polarities  from  SentiWordNet'  In  Proceedings  of  the  2013  Conference  on  empirical  Methods  in  Natural  Language  Processing  (CMNLP’13).  pp  1259-1209  Seattle  Washington  USA.  2013  hlt.fbfc.eu/technologiet 
/semiwords 

Si!iiliW(MiiNi!l  Sji-fiinn  Bnccraimllii.  Andieii  Fsuli.  Fabri/io  Snbcislioni  '  S<'iUiWm(INi-r  3  0  An  Fnhancnd  1  nncol  Resource  far  Siaitimi-nl  Analysis  and  Qpriim  Mmini)"  In  Piocnirdings  of  Ihn  Snvnnth  Ciaifnraici;  wi  fatiiuialioniil  I  anguagi!  Rnsnuicim  and  Fv.Wialinn  (l  RFC10)  (May  2010) 
semiwordnet.lnl.cnr.il 

VnliMiri!.  AimiK.il,  Dominanci;  Wainnnr  A  H  .  Kupnrinon.  V  and  Hrysbanit  M  (?013)  Tiorrns  nl  vahmen.  iBonsai.  and  dominance  bi  13  915  Fnglish  fammas"  Hnhnvxn  Krsnarch  Methods  45.  pp  1191  1701  crMiyi!nl.l**/nu:liivi*£/1003 


1 

praise ''n«  1 

0  875 

deny*vn| 

0  867 

happy«a» 1 

0  880 

murder  «ti"  1 

8  47 

happy ia* 1 

1  48 

munln  *n*  1 

7  74 

gunontfl 

2  14 

paithipiokpi'n"  1 

0  875 

happy*** 1 

0  875 

moan*v*l 

0  750 

love*v*3 

-0  S65 

rape»n* » 

8 

love*v*> 

1  54 

rape*n*3 

7  24 

rapp*n*  3 

2  47 

rapp*n*3 

0.75 

fiagiantna"  1 

0.875 

feertfnfrl 

0.737 

faithful  a  1 

0.832 

die*v*  1 

7.95 

faithful"  a"  1 

1.67 

die*v*l 

7.24 

make  n  1 

2.5 

enemy  "a*  1 

0  75 

goo<J*a*l 

0  75 

difficult***! 

0  722 

gOod*A#  1 

0  797 

kill»v#l 

7  89 

good*a*l 

1  81 

klll#V#l 

7  05 

attar  k*n*l 

2  61 

slave*n*  1 

0.75 

healthy*a*l 

0.75 

dirty***  1 

0.722 

smile“v*l 

-0.763 

pnson*n*l 

/.89 

smile®v*l 

1  94 

pmon*n*l 

6.91 

spider=n*l 

2.63 

dro\vn=i*3 

0.75 

bcautiful=a*  1 

0.75 

fbrget*v#2 

0.697 

watcTfall*n*l 

-0762 

burv*v*2 

7.86 

rest*  i=2 

1.95 

bury»v*2 

6.9 

die^wl 

2.84 

sick*a*l 

0  625 

tnnocmttf&o  1 

0  75 

danger  »ni»  3 

0  695 

ki!M*V*  1 

0  762 

«lpbt»n»2 

781 

CWP»t*l 

1  95 

debtnn"2 

6  86 

moiiPV"n"  1 

2  86 

deafra" 1 

0.625 

wise*a*l 

0.75 

seoM#v*l 

0.692 

iwcct*a*l 

-0  '60 

hate«v»l 

1.19 

waterfall*n*  1 

1.96 

hate*v»l 

6.81 

nch*a*l 

2.94 

drunk=a=  1 

0  625 

love*v*3 

0  75 

pilyMI 

0  690 

healthy**#  1 

0  755 

•.omit  \  "1 

7  78 

kiss*v*  1 

1  98 

YORllt#Y  *  | 

6  81 

kill*v*l 

3 

steal"  v»  1 

0.625 

easv***l 

0.75 

wrong* a® 1 

0.687 

peace=n*l 

-0750 

attack*n*  1 

7.77 

s\veet*a*l 

2 

attack*n*  1 

676 

earthquake*n*  1 

3.04 

blind*a*  1 

0.625 

faithful**#  1 

0.75 

stupid*a*  1 

0.682 

give*v#l 

-0733 

*lavc*n*l 

7.76 

healthv*a*l 

2.06 

»lave»n*l 

6. 75 

lttthming*n*  1 

304 

shiver^-*  1 

0  625 

wagetfrn* 1 

0  75 

weep* \ “ 1 

0  670 

neu“a«  1 

0  725 

gieedv"a=  1 

7  75 

peare"n" 1 

2  1 

pteedv"a#l 

6  62 

lauyh»\"l 

3  05 

t.ixnn"  1 

0.5 

tmc»a»l 

0.75 

hatc*v*l 

0.662 

praisc*n*l 

-0.  '10 

poison*n*l 

773 

givc*v*l 

2.16 

poison=n=l 

6.6 

tlame*n»  1 

306 

doubt»n*l 

05 

l.lll".V'l 

0  75 

sirk*a»  1 

0  660 

spiinglmip|,n«l 

0  707 

stpal»v*l 

768 

npw*a*l 

2  17 

stpal*v*  1 

6  55 

scat  *nil  1 

309 

fairiinp*n"2 

05 

naked*** 1 

0  75 

short**®  } 

0  660 

*pcinp*n*3 

-0  695 

enem\*n*1 

7  65 

praup*n*1 

2  22 

fneim®n*l 

6  48 

crocodi!e*n*l 

3  11 

lou-aja? 

0.3 

hand=n=l 

0.75 

boil*n*l 

0.632 

bcauntul=a=  1 

-0.692 

\var=n=l 

7.64 

springTime=n=l 

2.23 

war*n*l 

643 

scorpion=n=3 

3.12 

fever=n=  1 

0.5 

tintc*n*3 

0.75 

raw*e*3 

0.647 

-0.685 

wjdwa*n*l 

7.64 

>pniiK-n-3 

2.26 

widow  a*B*l 

6.42 

iidultci'-n**l 

3.14 

thicf*n«l 

0.3 

clear*a*l 

0.75 

stuikmg*a*2 

0.64/ 

v»ctory*n*l 

-0.683 

arson*n*l 

7.61 

beautiful*  a*  1 

2.26 

arson*n*l 

6.35 

gold*n*3 

3.16 

envv*n*2 

0.5 

warm«a*  1 

0.75 

cold*a*l 

0.645 

daytimc*n*l 

-0  680 

wido\v*n*l 

7.59 

tree*n*l 

2.28 

wido\v*n*l 

6.33 

fighT*\-»  1 

3.17 

owe*v=l 

05 

rescue*v*l 

0  625 

bad" a"  1 

0  640 

laugh™  \  o  1 

0  680 

convict»v»l 

7  59 

victory  #n*  1 

2  28 

convict»v*l 

629 

shunt 11  \ 

3  18 

niK»kptlMa"  1 

0.5 

light*  a“6 

0.625 

fault*n*l 

0.637 

plav*v*l 

-0  677 

sick*a*l 

7.58 

da\lime*n*  1 

2.29 

sick*a*l 

6.27 

\veapon*n*l 

3.2 

sneered*  1 

0.5 

expen»tve=a*  1 

0.625 

no*r*3 

0.630 

f«nalc*a*3 

-0.6/0 

thicl*n*l 

7.56 

laueh*v*l 

2.32 

thief*n*l 

6.27 

war*n*  1 

3.22 

suspect*v=l 

0.5 

clever*a*2 

0.625 

never**#  1 

0630 

food»n*l 

-0  667 

gricP*n*l 

7.55 

plav*v*l 

2.33 

drown*i*3 

626 

hell*n*4 

3.23 

widower*n*  1 

0.5 

strong=a=l 

0.625 

6ccman*n*  1 

0.630 

femalc=a=  1 

-0.660 

deccptton=n=  1 

7.52 

female*a*3 

2.33 

gne^n*l 

626 

hate*rv*l 

3.23 

corpse*!i*l 

05 

leacliiiv#  1 

0  625 

1UU'1p.1I'M"2 

0  625 

hpavpiK'n"  1 

0  655 

anxiety  uni  1 

7  52 

fixxln  1 

2  36 

dprpption«n*  1 

6  21 

muiclpr»n«l 

3  26 

giipft'nwl 

05 

poet*n*1 

0  625 

guilty***  1 

0  625 

b»the«v*3 

-0  652 

su*pect*v*l 

7  52 

fpmalp*a*l 

2  38 

anxiPts-an*! 

621 

imprppnatp*v*4 

3  27 

war*n*1 

0.3 

clcan=a=l 

0.623 

mistakc=n=l 

0.623 

sing=v=2 

-0.632 

liC*V*3 

7.5 

heavcn=n*l 

2.39 

suspecpw*l 

6.2 

Uck*%-*2 

3.28 

strikc*\-*l 

0  144 

demon*nwl 

0  675 

unripe*a«l 

0  625 

warm*»*l 

0  650 

gravp*n«? 

75 

bathe*v*3 

2  39 

1ip*v*5 

6  14 

fpar*n*l 

3  28 

dip^K-#! 

0.375 

old=a=l 

0.623 

grief=n=l 

0.623 

sumiuer=n=  1 

-0.642 

ronen=a=2 

7.5 

sing*v=2 

2.4 

grave=n*2 

6.1 

cxpcnsivc=aa!l 

3.29 

threatciv*’.-*2 

0.375 

junilui-a- 1 

0.625 

dcccp<ioti“n«  1 

0.620 

s»lvci«n*l 

-0.637 

co*psc*n#l 

7.5 

WBtm*a*l 

2.42 

cliokc-t— 4 

6.1 

faimnc-n"  2 

3.3 

cnptn  c*n*l 

0  975 

prowl"*"] 

0  556 

deman"n*  1 

0  670 

bope#v#l 

0  635 

cockioach”n*  1 

75 

summer  "nnl 

2  43 

ioTlpn»a«? 

6  05 

mayir*Tn*l 

3  32 

fpin^n"  1 

0.3/5 

angw*n*l 

0.3 

bad  luck*n*3 

0.61/ 

star*n*J 

-0.632 

uglv*a*l 

7.48 

silvex*n*l 

2.43 

corpse*n*l 

6.03 

kissel 

3.33 

blamc*n*l 

0  375 

emtnace*Y#2 

05 

sow  *a*  2 

0617 

easy “a*  1 

0627 

dn  or  rp"n“  1 

7  48 

hope -  v " 1 

2  45 

hurt*i*l 

605 

fiip“n«3 

3  33 

iululipt  v"n"  1 

0.375 

calm*a*2 

0.5 

c«tain*a*2 

0.607 

dream®v*2 

-0.625 

anger*n*l 

7.47 

star*n*3 

246 

cockroach*n=  1 

6.05 

I 

1 

3.37 

rotten»a*2 

0.375 

loud=a=l 

0.5 

blame=n=l 

0.605 

\vue=a*l 

-0.620 

beirav=v=2 

7.4 ' 

easv*a*l 

2.47 

ugl\-*a=  1 

6.03 

swun*v*l 

3.38 

murder*n*  1 

Coverage  overview  We  saw  the  summary  overview  on  page  1  of  this  report.  The  detailed  view 
first  summarizes  all  ISO  codes,  then  lists  sources  with  other  details  one  by  one 


Sets  per  ISO  code  (larger  numbers  imply  language  surveys) 

NOMF  1  are  1  arn?  aH.  1  aHv  9  aH?  1  aii  1  akll 

alk  2 

ane  1 

Details  by  bibref 

ISO  items  bibref 

col 

lang 

Lavi 

anl  7 

app?  1 

atb  1 

atq  2 

ban  1 

bbc  1 

bca  1 

bhz  1 

bje  2 

bkz  1 

1217 

tryon  1 995comparative 

17 

Acehnese 

bit  1 

bmt  2 

bod  2 

bpn  2 

bps  1 

brb  1 

bru  2 

brv.1 

bug  2 

bwx  2 

bxd  1 

bzh?  1 

cam  1 

cbn  1 

cek  10 

ege  1 

ckn  2 

clj:3 

elk:  1 

clt  8 

acn 

1676 

huang1992tbl 

huang1992tbl 

huang1992tbl 

28 

Ac  hang 
Adi 

Amdo  Tibetan 

cmr  10 

cth  1 

cmw  2 

czt  1 

cmyl 
dad  1 

cnb  9 
dao  26 

cng:1 
ddg  1 

enh  1 

dis  1 

c<>g ' 

dru  1 

cqd:1 
dup  1 

esh  19 

duu  1 

csv  8 
enu  1 

adi 

adx 

1770 

1345 

24 

5 

ero  1 

ers:  1 

«)  i 

gil:1 

gor:2 

get'  i 

hea  2 

hit:  8 

hmd  1 

hmi  1 

adx 

1702 

huang1992tbl 

4 

Amdo  Tibetan 

hmj  1 

hml  2 

hmm  2 

hnn  1 

how  1 

huj  1 

in  2 

ind  2 

irh:  1 

in  1 

adz 

984 

tryon  1995comparative 

49 

Adzera 

isd:1 

lum  4 

jae  1 

jav  1 

jeh  1 

JIU  1 

jmn  2 

jya  1 

kac:1 

kaf  1 

1276 

tryonl  995comparative 
tryon  1995comparative 

66 

A'jie 

kdt  7 

kem  1 

kqc  1 

kgd  1 

khb  2 

khm  1 

kij  1 

kix  1 

kic:1 

kjg 

—  - 

kill 

kmk  1 

ksd  1 

ksw  1 

ktv.1 

kuf  3 

kvo  1 

kwd  1 

kzf.1 

Ibo  3 

692 

huffman  1 971  vocabulary 
tryon  1995comparative 

12 

lep  1 

lew  1 

Ihu  1 

lid  1 

lis  1 

llu  1 

Ipn  1 

Isi  1 

lus  1 

Iww  1 

ane 

1069 

67 

Xaracuu 

Izn  7 

mad  1 

mah  1 

mak  1 

mdh  1 

mdr  1 

mek  1 

meu  1 

mhs  1 

mhu  1 

anl 

451 

Ism2015chin 

122 

Anu-Hkongso  Chin 

rnhxl 

min:1 

mji  3 

mjw  1 

mkz  1 

mlf  1 

mmr  2 

mna  1 

mm2 

mnw  2 

anl 

451 

Ism2015chin 

123 

Anu-Hkongso  Chin 

mrh  1 

mxe  1 

mm  1 
mxj:1 

mro  3 
my  a  1 

mva  1 

nbe  1 

mvm  1 

nbi  2 

mw  1 

nem:1 

anl 

452 

Ism2015chm 

118 

Anu-Hkongso  Chin 

mwt  1 

mww  1 

nbu  2 

nen  1 

anl 

454 

Ism2015chm 

119 

Anu-Hkongso  Chin 
Anu-Hkongso  Chin 

ngt  4 

njb  1 

njh  1 

njm  2 

njn  1 

njo  3 

nkh  1 

nki  1 

nlql 

nme  1 

anl 

nmf  1 

nmy  1 

nn9  1 

nnl  1 

nnp  1 

nod  1 

nPg  3 

nPh  1 

npo  1 

npy  1 

458 

121 

Anu-Hkongso  Chin 
Raga 
Zaiwa 

nqq2 

nqy  i 

nre  1 

nn  1 

nsa  1 

nsm  1 

nst  51 

ntx:3 

nuf  1 

nun  1 

OPP7 

atb 

670 

tryon  1995comparative 
huang1992tbl 

58 

nut  1 

nxa  1 

nxg  1 

nxk  2 

nxq  1 

nzm  1 

oog  1 

ors  1 

pac  1 

peb  1 

1843 

30 

pha  2 

pit:  1 

plw  2 

pma  1 

pmf  1 

pmi  1 

pmj  1 

pnu  2 

pon  1 

ppk  1 

atq 

626 

arnaud1997lexique 

16 

Aralle-Tabulahan 

pss  1 

psw  1 

ptt  1 

pwm  1| 

pwn  1 

pyu  1 

pzn  4 

qvy  1 

rap  1 

rog:1 

alq 

663 

arnaud1997lexique 

15 

Aralle-Tabulahan 

rtc  4 

rtm  1 

rug  1 

sas  1 

sda  1 

sez  8 

shn  2 

shx  1 

skb  1 

ski:  1 

ban 

1190 

tryon  1 995comparative 

24 

Balinese 

smo  1 

ssb  2 

sse  1 

sss  4 

sti  1 

sun  1 

sxg  1 

szw  1 

tab  1 

tao  2 

bbc 

1199 

tryon  1 995comparative 

18 

Toba  Batak 

huang1992tbl 

tnn  1 

ton  1 

tpu  1 

tsj  1 

tsu  1 

tttl  1 

tto  2 

tts  1 

twh  1 

twm  1 

bhz 

800 

arnaud1997lexique 

9 

Bada  (Indonesia) 

bje 

358 

wang1995miao 

22 

BiaoUiao  Mien 

twu  1 

umn  4 

weu  4 

wew  1 

wlo  1 

woe  1 

wtw  1 

wyy  1 

xct  2 

ycl:1 

bje 

397 

wang1995miao 

21 

Biao-Jiao  Mien 

yim  1 
zhn  2 

ysnl 

zlj2 

ywq  1 
zln  1 

zch  1 

zeh  3 

zgb  6 
zyg2 

zgn  3 
zyj  1 

zha  7 

zhb  1 

zhd  1 

bkz 

901 

arnaud1997lexique 

29 

Bungku 

zqe  1 

zyb4 

zyn  5 

zyp  1 

zzj4 

bit 

968 

hudak2008comparative 

3 

Tai  Dam 

bmt 

400 

ratliff2010language 

10 

Biao  Mon 

bmt 

434 

wang1995miao 

18 

Biao  Mon 

c„  rx'C 

1900 

huang1992tbl 

3 

Tibetan 

64 
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(1)  Build  the  data  universe 

Source  specification _ 

~  bibref 

Data  content  /  quality  0  silver  or  better  only 
Linguistic  spec 

ISO  add  |  no  relations  -<•  | 
family  use  |  Ethno  18  v  |  analysis 
Geographic  spec _ 

I  at. long 
ADM  name 

any  country  ^  |  country  /  area 

Proximity  If  appropriate,  include  all  languages  in  this: 
©ignore  Ocountry  OADM-1  OADM-2  OADM-3 
|  pick  v|  mile  radius  of  a  given  ISO  code  or  lat.long 


This  page  begins  to  develop  the  underlying 
functionality  that  will  be  required  by  more 
conventional  dictionary  applications.  It  takes 
an  unconventional  approach  that  is  necessitated 
partly  by  the  very,  very  large  amount  of  date 
we  provide  access  to,  and  partly  by  our 
anticipation  of  LORELEI’ s  specific  needs  -  in 
particular,  the  ability  to  focus  or  extend  queries 
by  region  and  relations. 

(1)  Build  the  data  universe  In  effect,  this  step 
instantiates  the  dataset  we  wish  to  query.  By 
default,  queries  are  limited  to  silver-grade 
normalized  datasets. 


(2)  Filter  the  data 

Semantic 


any 

v  |  part  of  speech 

final  gloss 


fallback:  ©none  Oderivs  OMG  syn  O  MG  cluster  WN 
□  extend  to  raw  glosses  D  inclusive  display 


Phonological _ 

final  form 
raw  form 

ignore:  0  syllables  0  raising  phonation 


(3)  Frame  the  data 

distance  0name  0  family  D  branch  altitude  speaker  count 


reset  all 


Search  Search 
return  Omap  ©table  O check  O  + forms  O report  analyze 


□  verbose  D  chatty 


(4)  Process  &  view 

Map  O names  only  ©embed  data  simplify  data 

Tabulate 

sort:  gloss  03,2,1  ©a, b,c  ISO  03,2,1  ®a,b,c  Obranch 
rows  are:  O glosses  OlSOs  ©automatic 
gloss  show:  Draw  0final 

-  don't  show:  0collapse  unused  WNI  glosses 
form  show:  D  raw  0  final 

-don'tshow:  0|  bounds  0 dupes  (0 do  show  counts) 

-  double-click  form  shows:  ®  sketches  metadata 


Linguistic  spec  define  the  universe  in  terms 
of  ISO  639-3  codes,  language  family  names 
(e.g.  AA/AN/HM/KD/ST),  or  their  analyzed 
phylogenetic  relations,  in  the  language  family 
tree.  Analyses  vary;  we  support  Ethnologue, 
Glottolog,  and  some  local  subgroupings. 

Geographic  spec  provide  some  means  of 
defining,  limiting,  or  extending  a  search.  This 
is  very  helpful  in  regions  with  high  language 
densities  and  mutual  influence. 

(2)  Filter  the  data  These  provide  what  is 
ordinarily  the  semantic  or  phonological  query. 
We  are  currently  focused  on  facilities  for 
semantic  fallback',  these  are  demonstrated 
below.  The  phonological  search  facility  is 
limited  at  present. 

(3)  Frame  the  data  Most  of  our  knowledge 
about  languages  is  actually  external  to  the 
original  data  sources.  Framing  lets  us  add 
lect-specific  facts  to  the  returned  forms  and 
glosses,  typically  to  aid  in  downstream 
applications  (e.g.  projection  onto  a  map). 


_  (4)  Process  &  view  Returned  data  will  vary 

dramatically  in  size  (from  one  item  to 
thousands)  and  intended  function.  Beyond 
obvious  alternatives  of  map  or  tabular  views,  we  may  wish  to  pass  results  to  downstream 
applications  (like  our  own  apps  in  /cogs,  discussed  below).  Again,  we  stress  that  these  tools  are 
not  intended  to  produce  a  user-facing  dictionary,  but  rather  to  help  us  instantiate  and  visualize  this 
low-level  functionality. 


Build  the  data  universe  We  can  limit  or  extend  the  search  universe  by  sources,  phylogenetic 
linguistic  specification,  or  geographical  bounds  /  regions.  This  is  important  in  areas  for  which 
data  is  limited  because  it  lets  queries  fall  back  to  languages  that  are  related,  or  which  are  likely  to 
be  loan  sources.  Below,  we  show  associated  dropdown  lists. 
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(1)  Build  the  data  universe 

Source  specification 


(1)  Build  the  data  universe 

Source  specification _ 


Data  content  /  quality 
Linguistic  spec 


0  silver  or  better  only 


ISO  add  |  no  relations 
family  use  |  Ethno  18  ~ 


J  analysis 


Geographic  spec 


any  country 


nclude  all  languages  in  this: 
\DM-1  OADM-2  O  ADM-3 
liven  ISO  code  or  lat  long 


lat.long 
ADM  name 
country  /  area 


regions 

mainland  SEA 
MSEA+China 
insular  SEA 
insular  Asia-Pacific 
mainland  Asia-Pacific 
ISEA+PNG+T  aiwan 
trans-Himalayas 
sub- Himalayas 
NE  Asia 
South  Asia 
Oceania 
mainland  SEA 
Cambodia 
Laos 

Myanmar 
Thailand 
Viet  Nam 

insular  SEA  _  ' 

distance  0  name  0  family  □  branch  altitude  speaker  count 


_ final  gloss 

>eech 

vs  O  MG  syn  O  MG  cluster  WN 
D  inclusive  display 


final  form 
raw  form 


aising  phonation 


bibref 


Data  content  /  quality 

0  silver  or  better  only 

Linguistic  spec 

ISO 

add 

no  relations  v 

family  us 

no  relations 

sisters 

Geographic  spec 

1st  cousins 

lat. lor 

2nd  cousins 

ADM 

3rd  cousins 

any  country 

^  |  country  /  area 

|  analysis 


Proximity  If  appropriate,  include  all  languages  in  this: 
©  ignore  O  country  OADM-1  OADM-2  O  ADM-3 
|  pick  ^  |  mile  radius  of  a  given  ISO  code  or  lat.long 


Now,  the  ISO  639-3  standard  only  specifies 
language  names  and  three-letter  codes. 
Information  regarding  phylogenetic  subgrouping 
and  speaker  location  must  be  provided  by  an 
external  analysis.  We  track  both  Glottolog  and 


Ethnologue,  the  only  wide-scale  analyses  available. 


The  graphic  below  is  produced  by  the  reference  tools  widget  on  the  far  right  of  the  /diet 
page;  it  shows  the  functionality  underlying  the  data  universe  specification.  The  user  enters  the 
first  few  letters  of  a  language  code  or  name;  we  identify  the  proper  code,  then  show  geographic 
and  subgrouping  data  as  available.  Note  that  these  are  by  no  means  always  in  agreement  - 
analyses  and  even  locations  may  vary  considerably. 


ISO  639-3  code  /  name  lookup 
[sou  Southern  Thai _ 

check  [  clear 


115.8  miles  (186.4  KM)  between  Glottolog  and  Ethnologue  reference  lat/long. 

Showing  names,  branches,  available  ADMs,  and  nearest  populated  place  (per  GeoNames). 


Glottolog  2.6 

sou  Southern  Thai 

Subgroup:  Tai-Kadai,  Kam-Tai,  Be-Tai,  Daic,  Central-Southwestern  Tai,  Wenma-Southwestern  Tai, 

Sapa-Southwestem  Tai,  Southwestern  Tai,  Southwestern  Thai  PH,  Lao-Thai 

Sisters::lao  |  tts  |  sou  |  tha 

Country:  Thailand 

PPL:  Ban  Laem  Khae  (1.1  miles) 


Ethnologue  18 

sou  Southern  Thai 

Subgroup:  Tai-Kadai,  Kam-Tai,  Tai,  Southwestern 

Sisters::aho  |  aio  |  bit  |  cuu  |  khb  |  kht  |  kkh  |  ksu  |  lao  |  nod  |  nyw  |  pdi  |  phk  |  pht  |  phu  |  puk  |  shn  | 
soa  |  sou  |  tdd  |  tha  |  the  |  thi  |  tiz  |  tjl  |  tmm  |  tts  |  twh  |  tyl  |  tyr  |  tys  |  tyt  |  yno 
Country:  Thailand 

ADM-1 :  Changwat  Nakhon  Si  Thammarat 
PPL:  Ban  Khlong  Chai  Tai  (1.4  miles) 

Lat/long  figures  given  by  these  sources  are  a  useful  fiction  that  approximate  a  speaker-population  "center"  (national 
languages  often  use  the  capital).  Place  names  occasionally  cannot  be  found  because  the  point  is  over  water  The  nearest 
populated  place  serves  as  a  proxy  for  exact  locations:  ADM-2  and  lesser  boundary  values  are  not  always  available 


Because  LORELEI-related  responders  are  likely  to  be  working  with  local  civil  authorities,  we 
have  gone  to  some  lengths  to  attempt  to  identify  speaker  neighborhoods  and  enclosing  regions  in 
terms  of  formal  ADM  identifiers  (and  vice  versa). 
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Filter  the  data  This  is  what  we  ordinarily  think  of  as  formulating  the  query.  Below  left,  we 
query  strike.  Part-of-speech  can  be  specified,  either  to  restrict  a  word  sense,  or  to  serve  as  a  filter 
in  place  of  any  particular  gloss,  (e.g.  we  might  request  all  kin  terms). 

Fallback  controls  semantic  expansion.  At  present,  options  include  derivs  (English  derived 
forms,  e.g.  “striking”,  “striker”),  the  MetaGloss  synonym  set  or  cluster  (semantically  equivalent 
or  related  terms),  or  strict  WordNet  synonym  sets.  The  extend  to  raw  glosses  option  looks  for  the 
(possibly  expanded)  search  term  in  the  raw,  copper  gloss  form  as  well  as  the  normalized  silver  (or 
gold-standard)  form.  One  consequence  of  expanding  semantic  targets  is  that  a  single  lect  may 
have  multiple  hits.  Normally,  we  suppress  secondary  items  -  if  the  initial  search  form  is  found, 
expanded  items  are  suppressed.  Inclusive  display  returns  all  items  all  items. 


As  noted,  phonological  query  options  are  limited  at  this  point;  available  options  include  the 
ability  to  ignore  syllable  boundaries,  and  to  heat  raised  items  (which  usually  represent  features  or 
secondary  phonemes)  as  though  they  were  ordinary  letters. 


(2)  Filter  the  data 

Semantic 


strike 


final  gloss 


verb _ part  of  speech 

fallback:  Onone  O  derivs  OMG  syn  ®  MG  cluster  WN 
0  extend  to  raw  glosses  □  inclusive  display 


Phonological 

final  form 
raw  form 

ignore:  0  syllables  0  raising  phonation 


The  result  of  this  search  is  shown  below. 


Limiting  datasets  to  silver  or  better  (uncheck  silver  or  better  for  higher/broader  test  volume): 

huffmanl  971  vocabulary  |  huffmanl 979vocabulary  |  theraphan20011anguages_l  |  theraphan2001 languages_2 1  hudak2008comparative  |  zhangl 999zhuang  |  huan 
Seeking  gloss  strike.  Will  fall  back  to  raw  glosses  for  any  unmatched  ISO  slot. 

Members  of  branch  KD  are: 

aho  |  aih  |  aio  |  aou  |  bit  |  by  k  |  cdy  |  cov  |  cuq  |  cuu  |  doc  |  enc  |  giq  |  gir  |  giu  |  gi  w  |  gqu  |  j  io  |  khb  |  kht  |  kkh  |  kmc  |  ksu  |  kyp  |  lao  |  laq  |  lbc  |  lbt  |  lha  |  lie  1 1  wh  |mkg|mlc|mlm|mmd|nod| 
Falling  back  to  MetaGloss  cluster  snakebite#n#l  |  v@snakebite#n#l  |strike#v#0|hit#v#3|strike#v#l  |deal_a_blow#v#0|  pound# v#l  |pound_on#v#0 
Falling  back  to  raw  glosses.  These  won't  be  displayed  unless  gloss  show  raw  is  checked. 

Reduced  to  134  entries  after  fallback  check  of  copper  glosses. 

Found  7  gloss  forms  (number  includes  raw  gloss  variants)  for  23  languages. 

Mouse  over  WordNet  item  for  sense  gloss,  double-click  to  look  up  base  word.  Double-click  form  for  source  lect  sketch. 


ISO 

ISO  639-3 

name 

family 

to  beat,  pound 

hudak2008comparairve 

beat^-^  | 
pounds-1*  1  (xl) 

to  hit  the  mark 

zhangl 999zhuang 

to  hit,  play,  etc.  (in 
phrases) 

hudak200Scomparatt\e 

to  pound 

hudak2008comparathe 

to  pound;to 
pestle 

zhangl 999zhuang 

to  strike  repeatedly, 
with  a  short  quick 
motion 

hudak2  OOScomparatne 

to  strike,  as  a 
snake 

hudak2008comparaii\e 

pip«v=2  (x51) 

pound=v«8  (xl) 

pestle=v£l:nce 

(x3<5) 

strike=v=  1  (x9) 

strike#\#l  (xl) 

strike^v*!)  (xl) 

[bit] 

Ta«  Dam 

Tai-Kadai 

tok45 

[khb] 

Lu 

Tai-Kadai 

tok45 

tok55 

[nod] 

Northern  Thai 

Tai-Kadai 

tup454 

tam14 

[nut] 

Nung  (Viet 
Nam) 

Tai-Kadai 

trk55 

[shn] 

Shan 

Tai-Kadai 

txuk55 

Jak55 

t3tu=21 

[skb] 

Saek 

Tai-Kadai 

nuk454 

[lha] 

Thai 

Tai-Kadai 

tok22 

[Its] 

Northeastern 

Thai 

Tai-Kadai 

tok24 

(fivhj 

Tai  Don 

Tai-Kadai 

tok45 

[zch] 

Central 

Hongshuihe 

Zhuang 

Tai-Kadai 

terj4: 

to:j33 

tarn42 

[zeh] 

Eastern 

Hongshuihe 

Zhuang 

Tai-Kadai 

ter)*5 

te  t)45  . 

toj55 

tsor)53 

rMin54 _ 

tam35 

tam45.:! 
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Frame  the  data 

As  noted  above,  the  lexicon  per  se  provides  very  little  information  about  the  language. 
Additional  lect-specific  information  is  usually  required  by  downstream  applications.  The  few 
choices  allowed  here  are  mainly  for  testing. 

Similarly,  most  of  the  Search  panel’s  controls  are  there  to  allow  testing. 


(3)  Frame  the  data 

distance  0name  0  family  □  branch  altitude  speaker  count 


Search 

Search  reset  all 

□  verbose  □  chatty 

return  O 

map  ® table  Ocheck  O+forms  O report  analyze 

We  have  already  seen  a  table  return.  A  map  view  is  shown  below.  The  map  control  is  shown  as 
an  inset.  Here,  we  see  words  for  bone#n#l  drawn  from  all  five  language  families.  The  buttons 
in  the  control  allow  more  detailed  displays  by  country,  language  family,  or  additional  query 
terms.  These  are  based  on  framing  data  that  was  passed  through  with  the  lexical  data. 


terrain  sate  hte  show  a 


Language  (x269) 

bone#n#1  (x269) 


hover  for  place  names 


Bangladesh  (1)  Bhutan  (1) 


[3)  Chile  (1)  China  (66) 


Cambodia 


(1)  India  (30) 


China-Taiwan  (6)  |  East  Timor  (2)  Fiji  (3)  French  Polyn 


esia 


(17)  Madagascar  (1)  Malaysia  (3)| 


Indonesia  (40)  Kiribati  (1 


Laos 


Marshall  Islands  (1)  Micronesia  (2)  Myanmar  (32)  New  Caledonia  (5) 


*  butol 
“  tu4?an 
baking 


Papua  New  Guinea  (13)  Philippines  (12)  Samoa  (1)  Solomon  Islands  (4) 


•tuhuj  (2) 
•  tu?lan 


Thailand  (8)  Tonga 


Vanuatu  (6)  Viet  Nam  (8) 


Austro-Asiatic  (29)  *  Austronesian  (105)  *  Hmong-Mien  (19) 

Sino-Tibetan  (93)  Tai-Kadai  (23) 


'-•fi'diilA-'' 


*  holiholi 

*  talon 


•  talar)  talo7® 


ms 


Process  and  view  The  last  set  of  options 
provides  more  detailed  control  of  the  display. 
The  scale  of  returned  results  varies  enormously 
-  both  the  lect  and  semantic  axes  may  have 
from  one  to  hundreds  of  items  each  -  so  our 
main  concern  is  making  very  large  data  views 
manageable 


(4)  Process  &  view 

Map  O  names  only  ®  embed  data  simplify  data 

Tabulate 

sort:  gloss  03,2,1  ®a.b.c  ISO  03,2.1  ®a.b.c  Obranch 

rows  are:  O  glosses  O  ISOs  ®  automatic 

gloss  show:  0raw  0  final 

•  don't  show:  0  collapse  unused  WN  glosses 

form  show:  0raw  0  final 

-don'tshow:  0|  bounds  0 dupes  ( 0 do  show  counts) 
-rdouble-click  form  shows:  S'  sketches  metadata 


New  cognate  sets 


query  search  j  ©table  Oxml 
®AA  OAN  Ohm  OKD  OST  Oall  Is  min  Dsave 


/cogs 

This  page  is  our  working  tool  for 
building  cognate  sets. 


Legacy  cognate  data 

|  |  query  (etygloss) 

Ohm  0  Ratliff  O  KD  0  Hudak  0  Joe  0  Weera 

OAA  0Shocto  OAN  0ABVD  0  Wolff  ACD  Zorc 

O  ST  0  STEDT  ©  all:  HM  KD  AA  ST  AN  Search 


reset  all 


Fallback  overviews 

fallbacks:  0  MG  cluster  0  inclusive  0  search  raw 


i  in™ 

tables 

i  h-> 

clusters 

Find  cognates 

include  MG  fa  1 1  babes  0  source  etySets  0  semantic  xrefs  0 
Above,  enter  query  (metagloss)  and  family  search 

|  prefix  |  |  suffix  |  |  char  suppress 

3%  v  |  cutoff  □  gnore  affixes  0  (info) 

Show  0  clusters  0  entries  0  legacy  IDs 
Include  0  raw  gloss 

Clustering  method  0  build  clusters  □  rebuild  cached 
(8)  standard  |  0.20  v'"  |  in-group  dist  |  0.02  v  [  delta 

O  MCL  |  8  ^  |  md  -pi  |  1.75  v  |  md  -I  timing  01 

Show  MetaGloss  counts  search 

Oaa  Oan  Ohm  Okd  ®ST||  Oa.b.c  ® 3.2.1 


A  huangl 992tbl  (x2545) 

B  lsm2015chin  (x468) 

C  lsm2015naga  (x484  ) 

D  marrisonl967classification  (x908) 

Click  words  below  to  pre-load  the  search  interface  above  with  partial 
or  fully  qualified  WordNet  /  MetaGloss  search  terms..  Blue  items  are 
in  the  ABVD  210  list,  asterisk *  items  are  in  the  LSM  ( similar  to 
MSEA)  list.  All  single-family  single  -word  distances  have  been 
pre-calc  ulated  os  well  as  many  (bui  not  all)  of  the  full  MetaGloss 
fallback  sets. 


w  +  nl|w  +  n|  w  ||  rice#n#l 

731 

A 

175 

B 

365 

C 

130 

D 

61 

633 

A 

153 

B 

239 

C 

149 

D 

92 

517 

A 

223 

B 

152 

C 

77 

D 

65 

It  II  t  A 

w  +  pl|w  +  p|  w  II  we#p#l 

510 

A 

90 

B 

242 

C 

150 

D 

28 

w  +  n3|w  +  n|  w  ||  paddy#n#3* 

473 

A 

47 

B 

239 

C 

149 

75 

D 

D 

38 

77 

w  +  ql|w  +  q|  w  ||  when#q#l 

471 

A_ 

172 

B 

147 

C 

w  +  vl|w  +  v|  w  ||  CUt#V#l 

444 

— 

A 

98 

B 

271 

C 

75 

D 

443 

A 

49 

B 

252 

C 

103 

D 

39 

442 

A 

35 

B 

241 

C 

144 

D 

22 

w  +  al|w  +  a|  w  ||  hot#a#l 

429 

A 

37 

B 

250 

C 

123 

D 

19 

w+nl|w+n|  w  ||  leech#n#l 

428 

A 

175 

B 

142 

C 

77 

D 

34 

407 

A 

134 

B 

151 

C 

76 

D 

46 

398 

A 

148 

B 

119 

C 

77 

D 

54 

397 

A 

94 

B 

266 

C 

7 

D 

30 

w  +  al |w  +  a|  w  ||  near#a#l* 

397 

A 

136 

B 

146 

C 

76 

D 

39 

w4-n2|w+n|  w  ||  chicken#n#2* 

391 

A 

99 

B 

144 

C 

76 

D 

72 

New  cognate  sets  Shows  sets  in 
progress.  These  can  be  restricted  by 
family,  or  by  number  of  families 
represented  by  a  given  etygloss  (the 
cognate  set’s  working  name). 

Legacy  cognate  data  This  provides 
access  to  our  database  of  existing 
comparative  and  proto-language 
reconstruction  data.  These  help 
identify  and  provide  support  for  new 
cognate  sets. 

Fallback  overviews  Raw  glossing  is 
often  imprecise;  even  when 
unambiguous,  semantics  tend  to  drift 
over  time.  Thus,  almost  every  new 
cognate  set  includes  items  draw  from 
subsets  with  distinct  raw  and 
normalized  glosses.  These  overview 
tools  help  us  get  a  sense  of  how 
broadly  to  cast  our  initial  net  in 
searching  for  relevant  cognates. 

Find  cognates  A  search  for 
potential  cognates  is  initiated  by  one 
or  more  semantic  queries,  usually 
requesting  automatic  inclusion  of 
related  fallback  items.  A 

phonological  distance  measure  is 
then  calculated  for  all  returned 
forms,  and  they  are  clustered  into 
potential  cognate  groups.  The 
mechanisms  by  which  distance  is 
measured,  and  items  are  then 
clustered,  are  both  highly 
configurable.  Optimal  settings  are 
difficult  to  predict,  and  are  heavily 
influenced  by  language  typology. 

Show  MetaGloss  counts 


Construction  of  cognate  sets  proceeds  methodically  through  the  lexicon.  At  this  early  stage,  we 
give  preference  to  semantics  that  are  found  in  as  many  lects  as  possible.  Some  of  the  very  high 
figures  seen  here  an  artifact  of  our  MetaGlossing  methodology  -  we  favor  base : modifier 
metaglosses,  because  the  base  generally  establishes  the  proper  cognate  set. 


We  note  in  passing  that  the  process  of  calculating  phonological  distance  between  all  word 
pairs,  and  of  clustering  subgroups  within  the  resultant  distance  tables,  are  both  computationally 
quite  expensive.  Thus,  we  pre-calculate  and  cache  huge  number  of  distances  (including  all 
predictable  fallbacks),  and  candidate  clusters  (based  on  a  half-dozen  different  clustering  settings). 
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New  cognate  sets  This  provides  views  of  the  current  state  of  cognate  set  assembly.  The  xml 
view  is  saved  and  distributed.  Below,  note  that  the  EtyGloss  names  the  set,  the  Refs  indicate  what 
the  original  full  glosses  were  (these  are  used  under  find  cognates,  discussed  below).  The 
individual  cluster  names  refer  to  citations  from  the  literature  when  possible  (e.g. 
AA:ash#n#l:S2034,  KD:ash#n#l:W119),  and  are  otherwise  simply  numbered  (AN:ash#n#1.3). 


Overlap  (EtyGloss  in  blue)  for  all  families:  2  etyglosses  for  6  families  33  etyglosses  for  5  families  59  etyglosses  for  4  families  283  etyglosses  for  3  families  560  etyglosses  for  2  families  2513  etyglosses  for  1  families 
(AA:215  AN:156  HM:518  KD:256  ST:128  ) 


EtyGloss:  AA:ash#n#l 

Refs:  ash#n#l  (x43)  |  dust#n#l  (x41)  |  ash#n#l:field  (x3) 

AA:ash#n#l:S2034  ::  pkeh  |  boh  |  phi:h  |  torpo:h  |  buh  |  buh  |  mahaw  |  poh  |  ca?  |  pko?  |  bo:h  |  bah  |  bah  |  bo:h  |  pka?  |  boh  |  p^nv  |  ?aboh  |  ?aboh  |  ?aboh  |  cabuh 


EtyGloss:  AN:ash#n#l 
Refs:  ash#n#l  (xll8) 

AN:ash#n#l:A146.1  ::  avu  |  abu  |  abca  |  fu:  |  a0u  |  abo  |  afii  |  avo  |  umu  |  eraba  |  kaw  |  gahuwej  |  awu  |  aw  |  aw:u  |  wao  |  awuk  |  wahu  |  ahu  |  rehu  |  rohu  |  gabu  |  qavu  |  kaboj  |  kabu  |  refu  |  raJJu  |  labu  |  zepu  |  ?aboh  |  abuh 
|  avubara  |  avu  nu  kaju  |  taj  hapu  |  taj  ahu  |  kahu  |  d_aj3usa:  |  qaJJuii? 

AN:ash#n#1.3  ::  "dap  |  "dep33  |  "dc 
AN:ash#n#1.2  ::  ahikdesan  |  ahukesan  |  ahuklesan 
AN:ash#n#1.5  ::  tajaw  |  tadjaw 
AN:ash#n#l:Al 46.11  ::  afuafu  |  efuefu 
AN:ash#n#l:Al 46.30  ::  vulimolas 
AN:ash#n#l:A146.8  ::  feraga 
AN:ash#n#l:Al 46.35  ::  makola 
AN:ash#n#l:A146.7  ::  pe:s 
AN:ash#n#l:Al 46.17  ::  dapog 
AN:ash#n#l:Al 46.37  ::  jaj  taen 
AN:ash#n#l:A146.3 ::  rapo  rabuka 


EtyGloss:  HM:ash#n#l 
Refs:  ash#n#l  (x34) 

HM:ash#n#l:R538  ::  |  qku3  |  swaj3  |  saj3  |  ci3  |  t§kaw3  |  so3  “  |  tskuB  |  0e3  |  cwaj3  |  sa:j3  |  chu35  |  sa43  |  s^a33  |  ce31  |  0e53  |  si33  |  sag52  |  s'va53  |  qwaj53  |  0waj53  |  saj545  |  qi44  |  sa:j53  |  swaj35  |  qi35  |  saj24  |  tskow55  |  ts"aw55  |  su13  |  sko13  | 
so232  |  tsku55  |  qow53 

EtyGloss:  KD:ash#n#l 

Refs:  ash#n#l:plant  (x38)  |  ash#n#l  (xl8) 

KD:ash#n#l:W119  ::  tkaw41  |  taw31  |  taw31  |  taw41  |  d'aw52  |  taw13  |  taw13  |  taw454  |  taw44  |  taw33  |  taw33  |  d'aw44  |  taw35  |  taw22  |  daw11  |  taw213  |  taw11  |  taw312  |  taw21  |  taw42  |  taw53  |  taw214  |  daw42  |  saw33 
KD:ash#n#l:P213  ::  piaw11  |  p’aw11  |  p^aw11  |  p’aw21 


EtyGloss:  ST:ash#n#l 

Refs:  ash#n#l  (x283)  |  ash#n#l:plant  (x48) 

ST:ash#n#1.5  ::  kik  |  kuk  |  “kuk  |  ku:k  |  “kudc 
ST:ash#n#1.4  ::  pighit  |  pumhi  |  purnhm  |  panhat  |  p unhat 

ST:ash#n#l:M5606  ::  pkelo  |  tkeklo  |  hatlA  |  tapla  |  tepla  |  tapla  |  thapla  |  a31  pla53  |  pla53  |  lua35  |  la35  |  lya55  |  b  |  lo  |  tap  la  |  tap51  la4  |  kku21  fa33  |  qkou  Id33  |  qo21  la35  |  la  tap 
ST:ash#n#l:M374  ::  go  tkal  |  gogtkal  |  tkalba 

ST:ash#n#l:M3514  ::  labu  |  la:bu:  |  mbut  |  vut  |  aot  |  ut  |  opu  |  m6ut  |  “put  |  “vut  |  “vit  |  ta  “vut  |  k1^  “vut  |  ku:kvit  |  ku:k  “vut  |  vitpot  |  vutpot  |  vat  tap  |  vajvat  |  vajvit  |  wajwut  |  wajwit  |  nivit  |  ra6a  |  labu  |  la:bu  |  la4  po2 
|  la2  bu2  |  maj  pku?  |  majpku  |  6aj  p°u  |  bajpku  |  pkajpku  |  majpko  |  maj  pku  |  wut  |  wit  |  *vit  |  vit  |  vut  |  wit  |  pu21  tski35  |  vui?  |  ha  bu  |  put  li4  |  bat  li3  |  bo33  tcki33  |  haj  pku  |  li2  pko?  |  lipko?  |  mvut  Jkkuj  |  m6ut  kkuj  |  “put  kkkuj 


Legacy  cognate  data  We  rely  on  and  refer  to  existing  analyses  whenever  possible,  using  a 
separately  constructed  database  of  reconstructed  proto-forms  and  comparative  sets.  In  this 


implementation,  these  can  be  queried  by  semantic  gloss. 


HM 

Level 

Gloss 

Reeon 

Fonns 

HM:228 

pHM 

dle*v»l 

ta* 

tai*  |  te*  |  tua*  |  to*  |  taa*  |  Aa° 

KD 

Level 

Gloss 

Reeon 

Forms 

KD:  11346 

pTai 

dle*v»1 

A2 

huv1  |  praay1  |  taav1  taa\~  |  taay*  |  thaav1 

KD 

Level 

Gloss 

Reeon 

Forms 

KD:P704 

pTal 

dle*v*l 

‘P-tatJ* 

ha.jA1  i  pra:J41  |  p*a:J41  |  ta:J*‘  t'1a:j41 

Kl) 

Level 

(doss 

Reeon 

Forms 

KD:W26S 

pKu 

die«v«1 

•pY’on* 

pen41  |  phf*‘  |  phan*1  |  puM*5 

AA 

(il»n 

Reron 

Forma 

AA:  1266a.  A 

to  die,  be 
extinguished 

'jap 

pain-jup  |  ponu>ap-[hata] 

AA:  1766a.  B 

to  die,  be 
extinguished 

•Jaap 

paqj  |  pjm-ijucp  |  paji^p  |  pamajiip 

AA.987A 

to  die 

•ketajt 

dial  |  kacet  |  kacecl  [  kaciit  |  slit  |  keel  | 
keet  |  chat  |  khchet  |  harm  |  poem  | 
paceet  |  kajiciit  |  kaslit  j  chot  |  sat  |  ea:t  | 
cult  I  kaclal  |  kacit  |  suit  |  cuat  |  diet  | 
hacot  |  (k)cnl  |  (ga)vat  |  'cert  |  'keirt 

AA:1218.A 

to  die 

’haan 

phaan  [  him  phaui  pha:n 

ST 

Reeon 

Level 

Proto  gloss  group 
(status) 

Morphemes  (duplicate  Items  are  suppressed).  These  are  raw,  unnormallred  forms. 

ST:27 

'say 

PTB 

DIE  (ok) 

chi  |  t£hi  |  chi  l  rhu  |  chi  |  chil  |  d  |  c*  |  di  |  hai  |  hi  |  hXi  |  k*JU  |  li  |  ntc**  ]  ri  |  se  |  sei“  |  *ejA  |  set  |  se*1  xeJ  |  se*1 

she  |  shel  |  shl  |  shV*  |  sh(l  |  shl  |  shl  |  she  |  si  |  si:  1  sla”  |  sle  |  sih  |  sill  |  slk11  |  sll  |  sit  |  sly  |  sly3  |  sty1  |  si’  |  si”  |  si”  | 

siu  j  si:  |  slat  |  si4 1  si*  |  slM  |  si14  |  sjhi  |  sjhy  |  sjidx  |  so  |  suh  |  suh:  |  sy  |  sya  1  syi  |  syid  |  syiy(?)  |  syi2 1  syy  |  si”  |  siu 
|  si  i  sih  |  a  |  stl  |  si  |  si  |  se  |  sa  |  saj  |  sat  say  |  say1  \  sa2  |  sa1*  |  san  |  sa”  sa”  |  saM  |  sc  |  sv”  !  st(”  |  si  |  si  si  |  si“ 

j  sll  |  sui  |  suit  |  sth  |  sni“  |  s)33  |  s)sl  |  *)**  |  s)u  |  sj“  |  s)“  |  s\“  |  s*rl  |  M  |  tchhi  |  the5 1  thl  |  thi,  |  thi  |  thi  J  th*y  |  tl  | 

tik  |  tsi  |  tstu  |  tdi  |  tajr  |  tvi  |  tjhs;  |  t*i  |  t*ay  |  t0e2 1  tOe”  |  t0f“  |  »  |  xe?"  |  xui“  |  zi  |  ii  |  |  Si  |  5b  |  Si  |  51 1  $  |  Si  | 

it  |  ci  |  ci55  |  ciu”  |  d”  |  cl”  |  d”  |  ci”  |  d”  |  cfc  |  ci4J  |  d”  |  cl**  |  co  |  ca“  |  «**  |  ci”  |  fid  1 1*  |  fi2 1  ji22  |  ji”  |  fi” 

1  fiK  1  ?o“  1  ju”  |  pi  i  ?c  |  fiu”  |  ftni  |  a”  |  n”  |  a“  1  fl”  1  1  flu  1  Je3 1  Jet4 1  Jed"  |  JeK  |  J1 1 JI33  |  Jih‘  | 

Jik”  1  n2  |  Jl”  |  Ji”  1  a”  1  ji4: 1  jl44  1  fi*!  1 »  1  (>(?)  1  Jsi”  1  fl4  1  Jl”  1  Jl,s  1  fl”  1  ft44 1  Jr  1  U"  1  c-i  |  eei*J  i  0eJ  |  0e'  |  01 

I  ei41 1  ef  1 01 1  I  01** 

ST:  29 

•svou. 

PTB 

DIE  /  SPIRIT  OF  DEAD  /  NON-VLABLE 
(ok) 

xwan  |  s6n 
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Fallback  overviews 

Before  attempting  to  identify  cognate  set  members,  we  usually  need  to  get  a  sense  of  how  the 
members  are  glossed,  and  how  much  semantic  variation  must  be  dealt  with.  Below,  a  search  for 
beat#v#3  falls  back  to  similar  semantics  (pound#v#l),  and  looks  for  beat  in  raw  glosses. 
Inclusive  search  allows  overlap;  an  exclusive  search  only  falls  back  when  the  target  isn’t  found. 

On  the  right,  pre-clustered  forms  make  it  easy  to  spot  probable  semantic  shift.  Items  with 
identical  glosses  are  subgrouped  by  the  similarity  of  their  forms;  this  does  not  guarantee  that  they 
are  cognates,  but  given  that  they  have  the  exact  same  meaning  it  is  highly  likely.  Once  clusters 
are  formed,  it  is  fairly  easy  to  eyeball  similar  groups  with  slightly  different  semantics.  Again, 
there  is  no  guarantee  that  they  are  cognate  (or  that  we  have  found  ah  possible  cognates). 
However,  this  process  helps  us  recognize  the  best  starting  point  for  the  find  cognates  step. 


limiting  datasets  to  silver  or  better: 

Huffman  1971  vocabulary'  |  huffinan  1 979 vocabulary  '  llicraphan200 1  languages  1 1  liicrap!iaii200 1  languages  2 1  hudak2008conij 

Uncheck  silver  or  better  for  highcr/btoadci  test  volume 

Seeking  gloss  beat*  v#3.  Will  (all  back  to  raw  glosses  for  any  unmatched  ISO  skx. 

Members  of  branch  kd  are: 

alio|aili|aio|aou|yha|byk|pcr|mlc|cov|zcli|rdy|cuq|<hd|zrh|rr«*|yxg|vnr|giu|giq|zgn|xgb|lic|jio|khl|luiu|kkli|ujin|lbt|llui|l 
3  18  entries  on  hand 

Falling  back  to  MetaGloss  cluster  beat#v*2|pummel#v*l|beat#v#3|batter#v«2 
Falling  back  to  raw  glosses.  These  won  t  be  displayed  unless  gloss  show  raw  is  checked. 

Reduced  to  102  entries  after  fallback  check  of  copper  glosses. 

Found  6  gloss  forms  (number  Includes  raw  gloss  variants)  for  23  languages. 

Mouse  over  WoidNet  item  for  sense  gloss,  double-click  to  look  up  base  word.  Double-click  form  for  source  led  sketch. 


ISO 

ISO  639  3 

family 

to  flog 

:hmt!999:ka*xs 

to  beat  a  drum 
.  kmgJ  M  hag 

to  beat  .to  fight 
:hotg:999!Wmt 

j 

to  beat  pound 

hammer,  to  beat 
Hdbr.'Wre'yaear*  t 

MIS' 

bean»v*3  tuck 

vm 

Jr  urn"'.  »2 

(*M) 

bcat*v*2  (xM) 

beat*-.  »3  <x!0) 

beat»v*3  | 
pound*V«l  (*9) 

hananei*val  ,*i) 

[bill 

Tai  Dare 

Tai-Kadai 

ti“ 

[kkb] 

La 

Ini  Knd.ii 

ti* 

b" 

tup" 

[■<*!] 

Northern 

That 

Tai  Kadai 

b 14 

tup444 

[■■0 

Xung  (Viet 
Nam) 

Tu-Kadai 

th:*4 

tup*4 

!«*>■) 

Shan 

Tai-Kadai 

t.“ 

ti*4 

tup31 

.Kq.43 

[vkb] 

Sack 

Tai  Kadai 

t^ap33 

Thai 

Tai-Kadai 

b:“ 

!*up" 

[««*] 

Xortheattem 

Thai 

Tai  Kadai 

ti" 

tSip44 

[ink] 

Tar  Don 

Tai  Kadai 

n- 

tup44 

XJO31 

[ich] 

Central 

Hongshurhe 

Tai-Kadai 

map13 

ta“ 

l«fc] 

Hongahuihe 

Zhuang 

Tm-Xadn! 

men43 

mop3’ 

mop" 

3nokM* 

ho" 

tup31 

tup31 

tup" 

l**b) 

Guitxri 

Zhuang 

Tai  Kadai 

lag331 

lit33 

map1* 

map,; 

mop34 

bo  o’4 
bu  n34 

JO33... 

VT 

lag-'" 

talk".. 

Limiting  datasets  to  silver  or  better 

huffinan  1 97 1  vocabulary  1  huff  man  1 979vocabu  larv  |  thcraphan200 1  languages  1 1  lhcraptian200 1  languages  2 1  hudak2008coinpara1 
Uncheck  ifhw  or  better  for  higher  .'broader  lest  volume 

Checking  semantic  fallbacks  for  bcal#v#3.  Below.  a|b  =  mote  or  less  equivalent .  a  b  =  close  grouping',  a  ::  b  =  distinct  sets'. 
Derivatives:  beat|beats|beating|bealen 
MetaGloss  xyn(s):  beat#v#3|batter*v#2 

MetaGloss  cluster(s):  beat*v#2|pummel*v*l  beat#v#3|batter*v#2 
WordNet  syn(s): 

The  I  forms  service  only  recognizes  family  requests  from  sllvcr-or-better  sources  for  now.  It  provide  a  quick  overview  of  (possibly)  fdl 
word-forms  glossed  with  (possibly)  related  semantics.  Only  one  instance  of  any  form  for  any  MetaGloss  family  combination  is  shown, 
compaiison,  forms  are  shown  in  numWird.  phonologicallv  similar  clusters,  with  line  breaks  altri  large:  groups  (rtretliiKl  sIaiulard.  aigs:< 
milage  will  vary). 

Searching  for  all  of  the  MetaGlosses  identified  above,  allowing  only  these  sources:  hudak2008eomparati ve  |  zhang  1 999zhuang 
The  potential  searrh  list  Includes  MetaGloss  syns ,  MetaGloss  clusters  (which  are  still  being  whittled  down),  and  WordNet  synsets: 
bcat#[vit]  #3 1 batters’ [vit]  »2|bcat*(vit]  *’2|pummel*’[vit]  #1. 

Our  search  list  only  includes  items  used  In  the  complete  dataset: . 


beat#V#2  KD  (xrrt  a  btchnf  to  mjoftft  to  a  btabn&  nthtr  m  a  punuivnmt  or  a»  an  art  of  qgrmwa  Thuf  bmthm  141  when  fa  mt&ni  dms  (hr  jtrrW  late  at  wjfit";  T 
la  boat  da  ahjdmb') 

1  dxp“ ,  tup11 ,  tup51 ,  tup” ,  tup”  ,  tup* ,  tup15 ,  tup”  ,  tu»p"  .  ttep”  ,  tyq>”  ,  tap”  ,  tap51 ,  tap”  ,  tap”  ,  tuip”  ,  t’hirp* 

2  map3’ ,  mop3’  3  dag”  ,  pag”  4  doej13  ,  duaj13  S  kuk” 


beat  *  V  *  3  K  I)  lJuf  uprated!*-.  tool  an  da  Soar";  irat  tfa  tahlr  tndi  ha  doa") 

1  U”  ,  0*  ,  tl.-** ,  U:u  ,  ti:14 ,  It4*  ,  tl:** ,  II4*4  ,  t*fc‘4 

beat *v*  3: stick  KD  (fat  Tp***?;  <W.  W  da  toU>  mth  ha 

1  map”  ,  map* 1’ ,  map1’ ,  mop” ,  mop13 ,  mop’4 ,  mop’3 ,  mop* ,  mop3* ,  mop13 ,  mup* 

2  hon”  ,  hon31  ,  bon33 ,  bon*: ,  horn4’  ,  bun”  ,  bun4’ ,  horn13 ,  ban” 

3  dap"  ,  tup13 ,  tup44  ,  tuq>”  ,  tap13  ,  t*u:p“ 

4  mak”  ,  mak”  ,  mok5‘ ,  ’node” 

5  dag"  ,  noarj”  .  paay"  6  worn"  ,  wraan*" ,  wort"  7  ?a:tn""  ,  Taan’14 ,  Varm1*  8  mam’1 ,  mem’1 ,  mem4*  9  dosj"  ,  duj,: 

das)1*  1 0  fa:k* ,  fok31 ,  fork*  1 1  lam"1 ,  but"  12k"a:g"  13  lag"1 


bra («v#3|clap#  V«3  KD  (hit  laptatrdfy;  an  da  Soar';  tool  thr  lahit  lailfi  ha  that'  |  (iqpaw'i  handi  or  thaut  afttr  ptrfamanm  to  indicate  apprcnxxt) 

1  tup“ 

brat#v4>3|poiind«V#l  KD  (ial  rrprofrdJj;  Src*  an  da  Soar';  da  foUr  ha  thor'  fat  hand  »sdl  da  hard,  fat,  or  umt  faavy  inxtrummc  1hr  ml* 
knariur';  ‘a  t*Ur4humpoig  Stxjdam  Baptat") 

1  top31 ,  tup" ,  tup1* ,  tup44 ,  tup4*4 ,  tuq>"  ,  t*ap"  ,  t*op"  ,  t*“up44 ,  t*up"  ,  t*up“ 


Find  cognates  Below,  a  close-up  of  the  menu  that  sets  up  the  query.  Now,  in  some  cases  data 
will  be  pre-assigned  to  likely  cognate  sets.  Thus,  in  addition  to  the  ordinary  semantic  fallback 
options,  we  are  also  able  to  expand  match  items  to  other  elements  of  the  same  cognate  set  (even  if 
they  have  different  semantics  -  source  EtySets),  to  other  elements  with  the  same  divergent 
semantics  (via  semantic  xrefs). 


Find  cognates 


louse#n#l 


include  MG  fallbacks  0  source  etySets  0  semantic  xrefs  0 
Above,  enter  query  {metagloss)  and  family  search 


prefix 

suffix 

3% 

cutoff  0 

ignore  affixes  0 

(info) 

char  suppress 


Show  0  dusters  0  entries 
Include  0  raw  gloss 
Clustering  method  0  build  clusters 


legacy  IDs 

0  rebuild  cached 


Once  elements  are  identified,  we  assess  their 
phonological  distance,  and  cluster  the  closest 
elements.  Now,  depending  on  language 
typology,  the  cognate  morpheme  might  not 
typically  be  a  free  lexeme  -  for  example,  some 
languages  might  tie  it  to  a  class  term  that  means 
“fruit”  or  “animal”.  In  order  to  create  more 
accurate  distance  measures,  we  provide 
mechanisms  for  suppressing  part  of  the  returned 
forms,  either  by  specifying  an  affix  to  ignore,  or 
by  assessing  each  lect’s  complete  word  list,  and 
inferring  likely  affixes. 
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Clustering  methods  are  also  configurable.  This  implementation  allows  two  types:  a  bottom- 
up  cigglomercitive  tree-building  approach  that  is  bounded  by  the  maximum  distance  between  any 
two  items,  and  Markov  Chain  clustering,  which  can  be  more  effective  for  properly  clustering 
items  from  relatively  continuous  dialect  chains. 

Below,  we  see  the  result  of  a  search  in  all  five  families  for  louse#n#l  and  its  MetaGloss 
fallbacks.  On  the  left  each  item  is  shown  with  source  and  language  information,  the  raw  gloss, 
the  proposed  cluster,  and  any  additional  information  that  could  be  derived  from  the  legacy 
cognate  data  discussed  above. 


On  the  right,  each  alternative  semantic  is  colored  differently;  this  is  helpful  assessing  likely 
cognate  status.  It  will  not  be  obvious,  but  in  this  case  each  of  the  clusters  on  the  right  naturally 
falls  into  a  language  family-specific  grouping:  1/AN,  2/ AN,  3/HM,  etc. 


irvon  1995compiram  « 

Cc6N2r>:r.»&3iu 

New 

Caledonia 

ane 

AN 

louse»n»l:head 

kilti 

i  AN:  108.1) 

try  onI995compiratii  * 

Ce?x?l?i?l?jOJSll 

i 

1 

1 

kink 

AN 

louse  «  n  ®l:hcad 

kurjtu 

.  kutu 

(AN:108.1) 

trvon  1  TOScompiratlv* 

cV?2rJI7»JI7  lOilll 

.Micronesia 

woe 

AN 

louse  *n*l  :head 

xusu 

giustu  |  gusu 

1  I  guiini 
\  \ :  1 08. 1 ) 

1  rvoo  1 995compa  ratir* 

C<?i(2l?i2t?i03.IU 

FIJI 

fij 

AN 

lousc*n*l:head 

kultu 

kutu 

<AN;108.1) 

1  non  199$comparatit  * 

CettitVtsli:  tflltU 

Fiji 

wyy 

AN 

louse*n*l:head 

ku|iju 

.  kutu 

AV10B.1) 

lr»onl995coinparathe 

FIJI 

i  tin 

AN 

lousc#n#l:hcad 

7u|fu 

Tufu 

AN:  108.1  • 

lnonl99$<omparatire 

Ce?6»Jl?»)l?  >03  til 

Tonga 

ton 

AN 

louse*n#  1  :head 

ku|tu 

kutu 

l  (AN:  108.1) 

AN:  108.3i 

try  on  1 995c  ompa  ratio 

Ce77r)l7tJI7  tOitll 

Samoa 

$mo 

AN 

louse#  n“  1  :head 

Tuitu 

Tutu 

1  (AN:  108.1) 

Iry  on  1 995compararlve 

c'cTt  r2i:ar#?j*u 

Vanuatu 

mxe 

AN 

louse*  n#  1  :head 

kmc 

ku|tu 

.  kutu 

(AN:108.1) 

Irvon 1 995comparaUv* 

C'«?9j:i'0t?iSJItl 

Punch 

Polynesia 

tah 

AN 

louse*n#  1  :hrad 

7u|tu 

utu  |  os too 

MMW.I) 

trvon  l996compar»rli» 

C<S*?J?i2r#)1  Sit 

Philippines 

AN 

louse*n*l:head 

ku-ito 

kulo  |  kutu 
(AN:  108.1) 

1  rvon  1 995coroparatl  ve 

C'jW/tr^tdiCJlU 

Chile 

rap 

AN 

louse*  it#  1  :head 

ku'tu 

kutu 

1  (AN:  108.1 1 

try  on  I995f  omparattvo 

c‘£*rtr.;r.o)*u 

i 

1 

akl 

AN 

louae#n#l:hrad 

kurtu 

ksi:tuh 

AN:  108.1) 

htiangiveztbl 

CxT2pHfl.J»JS6(377 

India 

mhu 

ST 

io,jieirn=  1 

B^ouj” 

2 

hulTman  19“  1  vocabulary 

Cc!r*S1gt3«i6101 

Cambodia 

khm 

AA  s39 

louse  O’  n  *  l 

Imar 

ay 

,  1 

(AA:39.A) 

huffmanlQ'Ivocabttlarv 

C  <10^415  g>»iM0) 

Laos 

bni 

AA  s39 

louse#n*l 

iOBM 

NL 

pell  |  pay  | 

(AA:39.Ai 

hulTman  19"  1  tocabnlars 

C  ell  r4S5  i66C4 

Laos 

ngt 

AA  S39 

louse  “n”l 

Naj 

• 

AA:39.A  j 

hufTmanl9*l\ocabularv 

Cci;tU5|i»iM0! 

Laos 

alk 

AA  s39 

louse*  It*  1 

ban 

"c^J 

-  jwaj  | 

_  (AA:39.A) 

hulTman  ICItorabularv 

C<I3HS54X<9J6<06 

Laos 

Ibo 

AA  S39 

louse*n*l 

C»J 

2  rJ‘ 

(AAr39.A) 

hufTmanlV'tvocabutarv 

CcH  |4S1  g»>9 16607 

Laos 

brb 

AA  s39 

louse  « It  «1 

low 

ay 

2  raj| 

(AA:39.A) 

huffmanl971vocabularv 

C  <16HJJ  ga39ii6609 

Cambodia 

tpu 

AA  s39 

louse#n#  1 

c^l 

cho4  | 

( AA:39.Ai 

hufftnanl9~l  vocabulary 

C<rHSSfvJ0.S<t0 

Thailand 

cog 

AA  $39 

louse*n*l 

cH: 

chli| 

.  AA:39  A  i 

huffmanl9"I  vocabulary 

CclS-HSJxjifiOOn 

Cambodia 

peb 

AA  s39 

louse*  n«  1 

c*i 

2  chhi | 

18006  new  comparisons  •  - 1935  2. 8  coca)  ham  222111  potential  pain  from  667  original  (797  unique)  weed*  In  68  seconds  overall. 


Hover  to  see  WordNet  glosses  tot  these  66/  matched  MetaGloss  Items.  Colors  correspond  to  forms.  (6?«a) 
lauirUnOilirad  ((402)  [  lomr*n«l  (xl95)  |  lotwr#n#l:b<>dy  (x'/O) 

Too  manv  partial  matches?  End  the  search  query  wish  a*  -  day#,  not  day. 

Toe  many  raUbaek  matches?  L'nehedc  mtiiide /aUbada 

Sot  enough  matches?  LV  inrluJr  fafibocb  and  a  filBy  qualified  WN'  itrtn:  burn*  v»S 

(dick  oner  lo  select,  then  dick  *  drag  items  or  legacy  ID  n  tint  bets, 

Double  click  any  item  to  move  It  to  the  last  row.  Double  click  any  number  to  delete  it. 


cotnoidate  ndribd  Indict  vivr  4.25  minutes 

EtyGloss:  AA|KD|HM|AN|$T:louse<rne| 

1  ><  I  kutu  kutu  |  kutu  |  gutjJ  |  kutu?  |  k*utu  k’uma  |  kuto  i  kuto  |  kutoh  |  ku[u  |  xtp”  fyu  Y*1 1  gull  |  gutea  |  hutu  | 

guiu  |  koto  |  ht“urtis  j  Tutu  |  koto  |  kit  |  kut  '  got  |  yutu  kurtu  |  kutu  |  kunui  |  kata  xtasu  |  kutju  Tufti  |  ktirto 

2  >  <  I  tsNuq33  |  caj  .  caj  | *cfc  |  *cfc  |  Naj  |  •'caj  |  caj  j  daj  |  di:  |  c'i  |  coa  |  *ce  |  -"c n  |  '’cts  |  *re:  |  Na:j  |  •’’a:  |  ce:  |  c*:  |  rV: 

$aj  xej  I  <fcaj  (JeJJ  ±tf  |  eej1  !  ’caj  cel  i  cel  dtej”  ce)5'  *cj*' 

3  >  <  I  her  |  htk  I  hlk  I  hik  I  hrk  I  b>rk  j  hoik  |  Itek  |  hik  j  ha:k  i  hhk  I  hok  |  Tlk  i  hit  |  Tit 

Ax  I  min*1  I  min31  man*4  I  min3:  '  men31  i  nan***  I  nan31  I  nan31  men1’  min3**  I  nan331  i  nan33  I  nan13  |  nan133  1 

man35  |  nen”  man”  |  men”  |  man”  |  mum”  |  nan*'*  |  man31  |  men”  |  man*'  |  nan45  |  nan”  |  nan**  |  nan” 

5  ><  |  soi  |  sdl”  |  t*aw**  |  t*aw“  raw33  raw"  law34  i  s*tm  I  shti  |  s*ui  |  v?aw*  |  Vowu  |  Vaw81  Jaw”  |  Jaw*4  I 

raw33  |  raw331 1  Paw4*  |  raw4*  |  ^aw33  |  Gaw3*  |  ibaiw3*  |  t*aw**  |  t*aw**  |  Gaw34  |  raw31*  |  law334  |  land3  \  Paw3*  |  flaw33 


6 

>  < 

|  xl?”  1  xsk  |  id?  |  xik  |  xflt  |  xflt  |  */lak  xik  |  xlk  .  yfl  |  kuk  |  Idc 

7 

>< 

temu  tuma  1  tup3  tumah  tiimah  1  &'»■'  mo”  1  su”  mat”  |  Git  |  Glp'Hi  |  dan’  i  dam’  |  tam3 1  tunw 

tumo  tatsu 

1  tom 

te  uti  |  tam13  |  tam13  |  tam^'  :  dan3*  dam34 

8 

>< 

1  trig  |  lb)3*  I  vVr  |  do  |  |iek  jlk  pk  |  Jik  Jlk  wk  |  sok  pk  1  Ji:k  |  rek14  |  jetk4  |  flik  |  jlk  |  telk  |  ilk  |  rtk  1  nk  ! 

9 

>< 

|  liaw34  |  haw33  |  haw33  haw3*  |  haw3*  |  Itaw4*  haw434  |  hit  |  haj  |  Iww  |  yaw34  |  haw43  haw3*  |  yaw4*  |  Jww343 

10 

utu  utu  |  ut:u  i  bSj*3  |  tc3 1  te3  ,  tu1  !  *tsur  |  uto  ut  1  aut  1  ote  1  tr11 1  o33  |  te44  |  tsl”  1  to“  1  tu“ 

11 

>< 

tuku  1  Gak4  1  ^*ek*  !  dik  |  dik  |  Guik  tlk  acak  |  GikNi  acak  |  tot)3  *  tag*  |  lay3  |  d:k  |  tut)*4  tat)3 

3  1  1 

too"  1 

t*o"  !  too43  !  tao“  1  too" 

12 

c 

l  cl”  |  co  |  Jl“  cfca"  |  ja”  |  cl”  |  |  ftn“  |  ce”  ce"  |  se“  ju"  |  ju“  |  sc”  cj”  |  s*l  |  ^u*  |  tc3*!3 1  cNi” 

13 

>  < 

!  ftkk*o  |  htk  k*a  1  tfelk  ka*3 1  Tlk  ku*  1  hlk  ku*  |  hlk  ku*  |  Ik  knr  i  vlk  ku3 1  Ik  knr1 1  Ik  Iditr3 1  jelk  kl* 

14 

>  < 

hal?  l  hi?  I  hluT 

15 

>< 

|  VH*  |  Ho3  *  |  Nt-'u*  |  “tsa34  |  °t*a”  |  nta31  |  Vi“  |  Vo13  |  tsa333  |  “ttV*  |  “tfW3 
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